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PREFACE 



This volume presents a series of papers developed by the autliors as 
part of their deliberations as members on the National Research Council's 
Committee of Program Evaluation 1n Education. As such, the j)apers repre- 
sent only the work of the Individual authors and In no way have been 
reviewed or endorsed by the Committee, the National Research Council, the 
National Academy of Science, and while we're disclaiming,, the UCLA Center 
for the Study of Evaluation, nor the National Institute of Education. • 

Yet while officially representing only themselves, the authors by 
topic and style provide a compact p/inoply of present evaluation thinking. 
Some authors hold the view, "let the flower bloom" 'and address their ' 
topic with dispassionate afipraisal of alternatives. No less skillful 
writers on the other hand, eschew such even handedness |nd inform the 
reader of, the characteristics of high quality studies. " 

^ Some papers are didactic,- and attempt to take care to explain 
with little ambiguity or abstraction what the critical issues are. 
Others present an enticing range of examples and support for favorite 
theses and summon up work horizontally from other science fields and 
historically from earlier epochs. Lacing the writing so richly heightens 
the reader's perception of the breadth and credibility of the author. 
These various virtues can be found within these efforts. 

'Certainly, if never plainly detailed, t.ie conflict between preferences 
in evaluation methodofogy comes through in these papers. The problem is. 
not the oft contrasted benefits of quantitative versus qualitative 
Information. Rossi and Berk correctly point out that various methods'of 
data generation can be used complementarily in both experimentally 
controlled and naturally varying designs. Left -for future debate, however; 
is the avowal of some of the authors that experimentaV manipulation 
represents the evaluation design of choice. Rather casually dismissed 
were concerns of political solvency, of program diffusion (or contamination). 
conse»^,a|isra.and delay inherent in developmental testing, as these concerns 
might specially Impact program evaluation choices 1?! education. For nyself,' 
Lwould havfe en<oyed an extended discussion of the measurement prcflems in * 
any of the designs avay able to us; 



But my intention is to praise the diversity that these views 
comnunicatfi!. One particular pleasure in reading these papers derived 
from the evaluation in contexts used by the writers. Rather than arguing 
as solely evaluators on the issue of program evaluation, more than one 
writer found it necessary to explain the role of'evaluation in' the larger 
theater of^educational development. Redirecting evaluators' views from 
exclusively research roots and extending their colloquium to include 
both research and development should certainly improve both the quality, 
academic virtufr, and practicality of consequent evaluations. 



Eva L. Baker 
November, 1981 
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- CHAPTER 1 , 

. AN OVERVIEW OF EVALUATION STRATEGIES AND PROCEDURES 

Peter H. Rossi ' 
University of Massachusetts ' 

Richard A. Berk 
University of California, Santa Barbara 

Introduction 

The purpose of this paper Is to provide, a detailed Introduction to 
the variety of purposes for which evaluation rfisearch may be used and 
to the range of methods that are currently employed In the practice of 
that field. Specific examples are given wherever appropriate to provide 
concrete Illustrations of both the goals of evaluation researches and 
, the methods used. - 

While the covelrage of this paper Is Intended to fce comprehensive 
in the sense of descrlblng.major uses of evaluation research. It cannat 

^even prtftend to be encyclopedic. The reader Interested In pursuing any 
of the topics discussed In this paper Is provided with references to 
more detailed discussions. In addition, there are several general 
references that survey the 1f1 eld' of evaluation In a iwre detailed fashion 

. (Suchman, 1967; Weiss, 1972, Croiibach, 1980, Rpssi, l^reeman & Wright, 
1979; Guttentag. a struenlng, 1976). 

Policy Issues and Evaluation Research 

Virtually all evaluation research begins with one or more, policy 
questions in search of relevant answers. Evaluation lesearch may be 
conducted to answer questions that|r1se during the formulation of policy, 
in the design of programs, -In the|mp»^Veraent of p^gra^is, and In testing 
the efficiency and effectiveness of programs that are In place or be1ng^ 
considered. Specific policy questions may be .concerned with how wide- ; 
spread a .social problem may be, whether any program can be enacted that 
win ameliorate a problem, whether^programs are effective, whether a 
program Is producing enough benefits to justify Its cost, and so on. 
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Given the diversity of policy ^questions to be answeredi It should 
not be surprising that there Is no single '^best way*^ to proceed *nd 
that evaluation research must dir^aw on a. variety of perspectives and 
on a pool of varied procedure?. , Thus, approaches that might be useful 
for determining what activities were actually undertaken under some 
educational program, for Instance, might not be appropriate when the 
time comes to deterjnine whether the program was worth the money spent. 
Similarly, techniques that may be effective In documenting how a program 
Is functioning oh a day-to-day basis may prove Inadequate for the task , 
of assessing the program's ultimate ImpM^^ In other words, the choice 
among evaluation methods derives Inltlinly from the particular question 
posed; appropriate evaluation technj^es must be linked explicitly to 
each of the policy questions posed. While this point may seem simple 
enough, it has been far too often overlooked, often resulting In force- 
fits between an evaluator's preferred method and particular questions at 
hand. Another result Is an evaluation research literature padded with 
•empty, sectarian debates between warring camps of "true believers". For 
example, there has been a long and somewhat tedious controversy about 
whether assessments of the Impact of social programs are best undertaken 
with research "designs in which subjects are randomly assigned to experi- 
mental and control groups or through theoretically derived causal models 
of how the progrj^im works. In fact, the two approaches are complementary 
and can be effectively wedded (e.g.v Rossi, Berk, & Lenthan-, 1980). 

To obtain a better understanding ofvthe fit be1;ween evaluation 
questions and the requisite evaluation procedures, it is useful; to 
distinguish between two broad evaluation contexts,^ as follows: 

Policy and Program Formation Contexts : Contexts in which policy 
questions are being raised about the nature and amount of social 
problems, whether appropriate policy actions can Jbe taken, and 
whether programs that may be proposed are appropriate and effective. 

Existing Policy and Existing Program Contexts :* Contents in which 
the issues are whether appropriate policies are: being pursued and 
whether existing programs are achieving ^their Intended effects. 



While these two broad contexts may be regarded as stages In a 
progression from th^-u^gnition of a policy need to the Installation 
and testing of pifograms designed to meet those policy needs, U is, ■ 
often the case that. the unfolding of a program in actuality may bypass 
some evaluation activities, f For example, Head Start and the Job Corps, 
were started up with minimum amounts of program testing beforehand: 
indeed, the issue of whetner Head Start was or was not effective did ' 
hot surface until some years after the program had been in place, and the 
Job Corps has Just recently (a decade and a half after enactment) been ^ 
evaluated in a sophisticated way (Mathematlca, 1980). Similarly, many 
programs apparently never get beyond the testing stage, either by being' 
shown to be ineffective or troublesome (e.g., contract learning, 
qramUch & Koshel, 1975) or because the policy issues to which they 
were addressed shifted in the meantime (e.g., (as in the case of negative 
Income, tax proposals, Rossi & Lyall, 1974). ' 

Unfortunately, a statement that evaluation techniques must respond 
to the questions that are posed at different stages of a program's life 
history,* only takes us part of the way, At the very least,' it Is 
necessary to specify criteria that may be used to select/appropriate 
evaluation procedures, given one or more particular policy, questions, 
fdr example, randomized experiments are an extremely powerful method 
for answering some of the questions posed a^ part of program design 
issues,. but may be largely irrelevant to or ineffective for answering 
questions associated with program implementation issues. Yet, such 
terms as "powerful", "rel'evant", and "ineffective" are hardly precise 
and the following four criteria are far more instructive. 

Fipst, one must cansider whether the measurement procedures that 
are being proposed are likely to capture accurately what they are supposed 
to measure. Sometimes such concerns with measurement quality are* 
considered under the rubric^ of "construct validtty" (Cook & Campbell , 
1979), and are germane to all empirical work regardless of the question 
being asked. For example, while it is apparent that an examination of 
the impact ,of Sesame Street on children's reading activity must rest 
on measures that properly reflect what edur -s mean by an ability to 



read, the same concerns are Just as relevant' In ethnographic accounts 
of how. parents "encourage" tHftIr children to watch Sestinie Street . One 
mufrt, presumably, have a clear Idea of kinds of Inducements parents _ 
might provide and field work procedures that systematically and accurately 
record the use of these Inducements. At the very least, field workers 
would have to be Instructed about how to recognize an "Inducement" as 
distinct from other sorts of Interaction occurring between parents and 
children. 

It Is Important to stress that questions *about measurement quality 
apply not only to program outcomes such ^ "learning", bi}t also to 
measures of the p»?pgram (Intervention) Itself and to other factors that 
may be at work (e.g., a child's motivation to learn). For example, the 
crux of the ongoing debat(^ about the Hawthorne Experiments undertakerT 
over 50 years ago Involves a judgment of whether the "real" treatment 
was a physical alteration In the worker's environment or changes In 
worker-employer relations affecting employee motivation (Franke & Kaul , 
1978; Franke*. 1979; Wardwell, 1979). 

Finally, while space limitations precljide a thorough discussion of 
measurement Issuesvin evaluation research, two generic kinds of measure- 
ment errors ;?hould be dIstingulsTied. On one hand, measurement may be 
subject to bias that reflects a systematic disparity between IndlcatorCs) 
and, an underlying true attribute that Is being gauged. Perhaps the most 
visible example Is foiJhd In the enormous literature on whether standard- 
ized IQ tests really tap "general Intelligence" In a culture free manner. 
x(.For a recent review see Cronbach, 1975.) On the other hand, measures 
m^ be flawed because of random error or "noise". Whether approached as 
an "errors In variables" problem through the econometric literature 
(e.g., Kjnenta, 1971 : 309-322) or as the "underadjus-tment" problem In the 
evaluation literature (e.g*;, Campbell & Erlebacher, 1970), random error 
"can lead to decidedly non-random distortions In evaluatflon results. (For 

^ iin?._*_Li_i ■■iiiiw I ■ tn '" !^ 

a receht^dlscusslon see Barhow , tafn" Ool dberger » 19f80.) The role; of * 
random measurement error Is sometimes addressed through the concept of 
VeHah1H.ty". : * 
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Secondly, many evaluation questions concern causal relations, as. 
for example, whether or not a specific proposed program of bilingual ^ 
education will improve the English language reading achievement of 
program participants. I.e., whether exposure to a program will "cause" 
changes In reading achievement. Whenever a causal relationship is 
proposed, alternative explanations must be addressed and presumably ' 
discarded. If such alternatives are not considered, one may be led to 
make "spurious" .causal inferences; the causal relationship being proposed 
may not in fact exist. Sometimes this concern with spurious causation 
is addressed under the heading of "internal validity" (Cook & Campbell, 
1979) and, as in t^e case of construct validity, is relevant regardless 
of stage in a program's life history (assuming causal i^elatiohshlps are 
at issue). For example, no one would dispute that a causal inference 
that Head. Start, for example, improves the school performance of young - ' 
children, must aljso consider the.alternative explanation that children 
participating in a -Head Start program were "simply" brighter than their * 
peers to begin withr Note that the same issues suWace in ethnographic 
accounts of how programs like Head Start function. A series Of documented 
observations suggesting that Head Start provides a more supportive 
atmosphere in which students may learn academic skills, requires that 
alternative explanations be considered. Thus, it may be that the content 
of Head Start programs are less important than the kinds of ^instructors 
who volunteer \or are recruited) to teach in. such programs, ( 

The ciJnsi deration of alterriat|j|e causal explanations for t+ie' 
working of social programs is an extremely importamt research design n 
consideration. -For^example, programs that deal with humans are all 
subject more or less to problems of self-selection; often enough persons 
who are most likely to be helped or who are already on the road to 
recov'ery are those most likely to p'^ticipate in a program. Thus, 
vocational training offered to unemployed adults is likely to attract 
thosi who would be most likely to improve their employment situation 
in any ev^nt. Or sometimes program operators "cream the best" among, 
target populations to participate in programs, thereby assuring that such 



. 5 li 



ERIC 



. programs appear to bei^ successful , Or, 1n other caiis, events unconnected 
witti the program produce Improvements which appear to be the result of . 
the program.! an Improvement In employment for adults, for Instance , 
may maktt 1t more Hkely that youhg people will stay In and complete 
their -high ^hool training, . 

It cannot be overemphaslwd that parallel design Issues necessarily 
surface In 'evaluations based on qualitative field work. While this point 
has a long and rich hisitory in the social sciences that routinely connect 
and analyze quaHtat'ive data (e.g., Zelditch, 1962; Becker, 19585 Mensh 
& Henry, 1953)', evaluationlresearchers have to date been somewhat slow 
to catch on. Too often "process research", for example, has become a 
license for research procedures that are little more than funded voyeurism. 
In short, there is more to field work than simply "hanging out". 

' Third, whatever the umpirical conclusions resulting from evaluation 
research during any of the tlwee program stages, it is necessary to 
c^sider how broadly one qan generalize the findings in question: that 
Is, are the findings relevant to other times, other subjects, similar 
programs and other program sites? Sometimes such concerns are raised' 
under the rubric of "external validity" (Cook & Campbell, 1979), and 
again » the Issues are germane 1n a11 program stages and regardless of 
evaluation method. Thus» even If a quantitative assessment of high 
school driver education programs Indicates that they do not reduce the 
number of automobile accidents experlenced'by teenagers (Roberts » 1980), 
It does not mean that adult driver education programs would be Ineffective. 
Similarly, an ethnographic account of why the driver education program 
did not work for teenagers » may or may not gSneralIze to adult driver 
educat/on programs. 

Generalization Issues ordinarily arl^e around several types of 
extensions of findings. For Instance^ are the findings applicable to 
other cities, agencies, or school systems, besides the ones In which they 
were found? Or are the results specific to the organizations In which the 
program was tested? Another issue that arises Is^whether a program's 
results would be applicable to students who are different In abilities 
or In socioeconomic background? For example. Sesame Street was fatind 



to be effective with respect to preschool children from lower socio- 
economic families, but also more effective with children from middle 
class families (Cook, et al . , 1975). Or, curricula that wprk well in 
Junior colleges may not be appropriate for. students In senior °col leges. 
There is also the problem of generalizing oyer time/ For example, 
■ Mftynard and Humane (1979) found that transfer payments provided by the 
Gary Income- Maintenance Experiment apparently Increased the reading scores 
of children from the experimental families. One possible explanation 
is that with Income subsidies, parents (^specially in single parent 
families) were able to work less and, therefore spend more time with their 
children. Even "if this is true, it raises the question of whether 
similar effects would be found presently when inflation is taking a much 
bigger bite out of the purchasing power of households. Finally, it is 
'impossible to Introduce precisely the same treatment(s) when studies are 
replicated or- when programs move from the development to the demonstration 
stage. Hence, one is always faced with trying to generalize across 
treatments. that can rarely be identical. In summary, external validity 
surfaces as a function of the subjpcts^of an evaluation, the setting, 
^he historical .period and the treatment itself. - Another way of phrasing 
this issue is to consider that.ijJt)grams Vfry in their "robustness"; that 
IS, in their ability to praduce the same |esults /under varying circum- 
stances, with different operators, and at different historical times. 
Clearly a-'robust"- program is highly desirable. 

Finally, it is always important to consider that whatever one's 
empirical assessments, that the role of "bhance" is properly taken, 
into account. When formal, quantitative findings are considered, this ^ 
"is^sometimes addrelsed under ^he heading of "statistical conclusion 
.validity" (Cook & (farapbell , 1979) and the problem is^ whether .tests 
for^ "statistical significance" have deen properly i^ndertak'en. For 
exampU,. perhaps Head Start children appear to perform better in early 
gradesl but at the same time, the ^observed differences in performance 
could easily result from chance factors having nothing to do with the 
progrtm. ^Unless the role of these chance factors is formally assessed, 
it is impossible to determine if th^e apparent program effects are real 



or illusory. Similar issues appesfir in^pthnographic work as yell 
as though formal assessments of the role of chance are difficult to 
undertake in such studies. Neverthreless » it i§ important to ask whether 
the reported findings rest on observed behavioral patteras that ocieured 
with sufficient frequency and stability to warrant the Conclusions that 
they are not "simply" the rjBsult of chance. No self-respecting 
ethnographer would base an analysis of the role of parental inducements 
in impact of Sesame Street , for example ron a single parent-child 
Interaction on -a particular morning* • . 

^ iThree types of factors play a role In producing apparent (chance) 
ef/ects that are not "real". The first , reflects sampling error and 
occurs whenever dne Is trying to make statements about some population 

•of interest from observations gathered on a subset of that population. 

For example, one^ight actually be Studying a sample of students from 

the population attending a particular school, or a sample oTteachprs 

from the population of teachers in a particular school system, or even 

a sample of schools fr^m a population of schools within a\ity, bounty; 

or stata* Yet, while. it is-typically more economical to work with samples 

the proems . of sampling necessarily , introduces the prospect that any 

conclusions bas^d on the sample^may well differ from conclusions. that 

might' have been reached had the/ full population been studied instead. 

Indeed, one could well imaging obtaining different" results from different 

A 

subsets of the population. ^ 

While any subset that 'is selected from ajarger population for 
study pur^poses may be called a sample, some such subsets may be wor^e 
than having no observations at all. The act of sampling must be acfcom-* 
pllshed according to rational selection procedures that guard against , 
ttie introduction of selection bias. A class of such sampling procedures 
that yield unbiased samples are called "probability samples" , in which 
every element in a population has a known chance of being selected 
(Sudman, 1976; Kish, 1965). Probability samples are. difficult to . 
execute and are p^ten quite expensive, especially when dealing with popu- 
lations that are difficult to locate .In space, tet thiare are sucjLclear 
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advantages to such, samples,. as opposed to haphazard and potentially 
biased methods of selerting subjects,' that probability samples are almost 
always to be preferred over less rational methods. 

Fortunately, when samplfes are drawn with probability procedures, 
'disparities between a sample and a given population can only result from 
the "luck of the draw^," and. with the proper use of statistical inference, 
the likely impact of these chance forces can be taken into/account. Thus, 
• one can place "confidence intervals" around estimates from probabil ity 
samples, or ask whether a sample estimate differs in a statistically 
significant manner from some assumed population value. In the case of 
confidence intervals,^ one can obtain a formal assessment of how much 
"wiggle" there is likely to be In: one's sample estimates. In the case of 
significance tests, one can reach a decision about whether a sample ^ 
stattstio (e.g., a mean reading score) differs from sonie assumed value 
in the population. For e;(ample, if the mean reading score from a random 
sample of^tudents differs from some national* fiorm, one can determine if 
the disparities represent statistically significant differences. 

A second kind o.f,chance factor stems from the process by which 
experimental subjects may be assigned to experimental and control groups. 
F<5r example, it may turn out that the assignment process yields an 
experimental group that on the average contains brighter students than 
the control group. As suggested earlier, this may confo.uhd any genuine 
treatment effects with ^ priori differences between experimentals and 
controls; here the impact of some positive treatment such as self-paced 
instructiorr will be arti factually enhanced because the experimentals 
were already performing better than the contro'ls. 

Much as in the case of random sampling, when the assignment is > * 
undertaken with probability procedures, the role of chance factors can 
be taken into account. In particular, it {s possible to determine the 
likelihood that outcome differences between experimentals and controls 
are statistically significant. If the disparities are statistically 
significant, chance j4:hrough the assignment process), is eliminated as an' ' 
expTariation, and the evaluator can then begin making substantive sense 



of the results/ If the pv;ocess by which some units get the treatment 
andoOthers do not Is not, a random process, one risks a ^'sample selection" 
bias that cannot b^^ assessed wifh statistical Inference. It Is, also 
possible to place confidence Intervals around estimates of the treatment / 
effect(s) vrhlch are usually couched as differences between the means on 
ohe or more outcome measures when the experimental are compared to the 
controlSc Again, an estimate of the "wlggl^" Is produced; In this case 
the "wiggle" refers to estimates of the experimental -ccntrol outcome 
differences. 

A third kind of chandfe factor has nothing to do'with research d.esign 
Interventions undertaken by the researcher (I.e., random sampling or 
random , assignment). Rathect It surfaces even- If -a given populatfon of 
Interest Is studied and no ass Ig n ment.^ . process fs undertaken *V In brief, 
. 1 f one|g)roceeds with the ass umptl^rv that w^hatevrer the educational processed 
at worf, there win be forces thft have no Systematic Impact on outcomes 
of Interest. Typically, these are viewed as a large number of small , 
random perturbations that or] the average cancel out. For example, 
performance. on a reading test may be affected by a child's. mood,, the 
amount of sleep gotten on the previous night, the quality of :t;he morning's 
breakfast, a recent quarrel with a sibling, distractions In theTo'om 
yihere the test Is taken, anxiety about the test's corise4uences and the 
like. While these each introduce small amounts. of variation In a child's 
performance, their aggregate impact Is taken to be lero on the averagfex . 
(l.e^, their expectcifd value Is zero). ^Yet since the aggregate Impact Is 
only zero on t)ie average, the performance of particular students on 
partlcuVir days will- be altered. Thus* there will be chance* variation In 
performance that ^^ds to be taken Into account. And as before one can 
apply tests of statistical Inference or confidence Intervals. .One can 
still ask, for exanfiple. If some observed difference between experimental 
and control Is larger*" than might be expected from these chance factors 
and/or estimate the^ ^'wiggle" In experimental -control disparities. 

It Is -Important to stress the statistical concluslpn validity 
speaks to the quality of inferential methods lipplled and not to whether 
some result Is statistically s^ignificant. Stati^stical conclusion validity 



may i>e high on low independent, of judgments, about statistical significance: 
(for a more thorough discussion of these and other issties of statistical 
inference in evaluation Vesearch |iee Berk & Brewer, 1978.) 

In sumiTiary, evaluation researchMnvolvcs a number of questions 
'linked to different stages in, a program's life history. Appropriate, 
evaluation tools must be selected with such stages in mind and, in 
addition, against the four criteHa -jusf discussed. In other words, at 
each stage one or more policy relevant questions m^y be paised. Then, 
evaluation procedures should be selected, with an eye to their relative 
strengths and weaknesses with respect to: measurement quality, an 
ability to weigh alternative causal, explanations, the prospects for 
generalizing, and their capabilities for assessing the role of chance. 

In the next few pages the general is,^ues }usi raised will be 
addressed In more depth. - However, before proceeding, it is also 
important to note that iri the "real world" of evaluation research, even - 
when an ideal marriage is made between the evaluation questions being . 
posed and the empirical techniques to be employed, practical constraints 
may intervene. That Is , questions- of cost, timeliness , political feas- 
ibility and other dif/iculties may prevent' the ideal from b^lng realized. 
This In turn will, require the development of a "second best" evaluation • 
package (or .even third best), more attuned to what is possible in . 
practice. On the other hand, practica*! constraints do not in any way 
validate a dismissal of technical concerns; >f anything, technical 
coircerns become even more salient- whan less desirable evaluation 
procedures are, employed. 

■ ' POLICY ISSUES AND CORRESPONDING EVALUATION STRATEGIES 

This section will, consider each of the major policy and program ^ 
questions in turn and will identify the appropriate evaluation research • 
strategies that are best fitted to provide answers to each of the policy 
questions.- 
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Policy Formation ahd Program Pes igrV Issues v 

He first consider policy ^^uestlons that arise In the policy 
formation and program design stige. Policy changes presumably arise out 
of dissatisfaction with existing policy, existing programs, or out 
of the realization that a piroblem exists for which a new policy may 
be an appropniate policy remedy. The information needed by policy 
makeri'and administrators is that which would make the policy and 
accompanying programs relevant to the problem as* identified and 
efficacious in providing at least relief from some of the' problem's 
burdens. It is important to stress that defining a "social problem" is 
ultimately a political process whose outcomes are not s^imply an assess- 
mertt of available info'rmatidn. Thus, while it would be hard to argue 
against providing the best possible data on potential area^ of need, 
there Is hO necessary correspondence between patterns in those' data and 
what eventually surfaces ^as a subject of concern. (See for example^ 
Berk&Rossf,197iS, for a morethorough discussion.) 

As indicated earlier, we do not mean to i^mply by the organization 
of this section^ that pol icy niakers always ask each of t|)6 questions 
raised in the order shown. The q^^^stions are- arranged from the more 
. gener^l^to the morj^ specific, but that Is an order we have imposed ' 
and is not intendefl tQ l^e a description of typical sequences br even 
a description. or any sequence. Indeed, ofteh . enough , for example^^~, 
research titat unto vers the extent aod depth of a social problem may 
spark the need for policy change, rathe than vice versa, as may appear 
to be Implled'^n ^^th'is ^section. * ' , . 

Where is. the Problem and How Much? The Needs Assessment :.Quest ion 

These are questtons that seek to understand the distribution and 
extent of a giveiTipirohlem. Thus, ft Is oifie thing to recognize 
some cM^dren are .learning it'a rate that Is too slow to a1]jpw them to 
leave . :<;3inentary schools sufficiently prepared for high scltool, and It 
Is quite another to know thlt this problem Is more pharacterlstfc pf 
poor children ahd of minorities and more frequently encountered In 
iitntr clty schools. It does not t^ke more than a .'few instances of . 
Slow learning to document that a 1 earning prob1 em exists . To provide 



sufficient information about the numbers of children who are in that 
deprived condition and to identify specific school systems with heavy 
concentratians of such children is quit€-another task.> Similar questions 
arise with respect to other conditions that constitute the recognized o 
social .problems of our times,. e,g»^ the distribution of quality medical . 
care, adequate housing, and so on. ' 

TUerp are numerous eo<amples of needs assessments "that might be 
cited. Indeed, the monthly measurement of the labor force is perhaps 
the most extensive effort at needs assessment, providing a monthly . • . 
estimate of unemployment and its distribution structurally and areally. 
Thf( Office of Economic Opportunities 1968 Survey of Economic Oppoct unity 
was designed to provide a finer grained assessment of the extent? and 
distribution of poverty in urban areas than was javallable through the 
decennial Census. The. Coleman et al. (1967) report of educational 
opportunity was mandated by Congress to provide an. assessment of how 
educational services and facil ities were distributed among the poor. 

The nu^ibef" of local needs assessments covering single municipalities 
towns or counties done every year must now mount-to the thousands. The 
Cpmrounity Mental Health legislation calls for such researches to be 
undiprtaJcen periodically. Social impact st-atelhents to be prepared In 
advance Of 'large .scale alterations In the envihJnment often' call for " 
estimates, of the numbers of persons or households to be affected or 
to be served. The quality of such local assessments varies widely and . 
is most likely on the average quite popr. The problem in attaining 
high quality needs assessments lies in the fact that the mehsuremeftt . 
•of social problems of the more subtle variety (e.g* * mental heal tH) 
is quite difficult and the surveying methods that need to. be employed 
are often beyond the reach of the talents and funds' available. 

It should be noted that the research effort involved in .providing 
answers to the needs assessment question, can be as Inexpensive as" copying 
relevant information "from printed volumes of the U.S. Censgs. to several 
years effort irtyolving the design, fielding and analysis of a large ' 



scale sample survey. Moreover, needs assessments do not have to be 
undertaken solely with quantitative techniques. Ethnographic research 
may also be Instructive, especially In getting detailed knowledge of 
the specific nature of the needs In question. For example^ the develop* 
ment of vocational training programs In secondary schools should respond 
to an understanding of precisely what sorts of job related skills are 
Uxking In some target population. Perhaps the real need has more to 

e 

do with how one finds a job commensurate with. one's abilities than 
w|tb an overall lack of skills per se (Liebow, 1964), On the other 
hand; when the time comes to assess the extent of the problem, there . 
Is ho substitute for formal quantitative procedures. Stated a bit 
crudely, ethnographic procedures ar$ likely to be especially effective 
In determining the nature of the need. Quantjtatlve procedures arei 
however, e^ssentlal when the extent of the need Is considered. 

While needs assessment research Is ordinarily lindertaken for 
the primary mission of developlng'^accurate estlmatesidf the amounts 
and distribution of a given probl em; and hence Is Intended to^be 
descrlptfve, often enough such research also can ^lald some understandin 
of the processes Involved In the qeneratlppoOf .the piNiblem In question. ' 
For exaropVe, a search for Information on how many high school students 
Study a non- Engl Ish language may bring to 1 Ight the fact that many 
schools do not offers such courses and hence that part of the problem 
is that universally avail rfblie opportunities to learn foreign languages * 
may not e)(1gt;\^ Or, the fact that many primary school children of low 
socipeconomi c backgrounds appear to be ti red and 1 i stl ess lii cl ass may 
be associated with ii finding, that fel such children ate anything at all 
for breakfast%efore coming to school . A program that provided in- 
school breakfast feeding of poor children 4nay be suggested by the ^ 
findings of this needs assessment. ^ 

Particularly important for uncovering process infbHnat||in of 
this sort. are carefully and sensitively conducted ethnographic studies. 
Thus ethnographic studies of disciplinary j^robVems within htgh schools 



may be able to point out promising leads as to why some schools, have 
•fewer disciplinary problems than others t^t addition -to providing some 
-indication of how widespread are problems associated witK d^^^ 
The findings on why schools differ might serve to suggest useful ways 
in which new programs could be designed that would help to bring all 
schools into line with those that are currently better at handling 
discipline issues. 

Can. We Do Anything A bout a Prob l em? Policy QHpntpd General R.sP.rrh 

'Knowing a lot about the distribution and e/tent of a problem 
does not by itself lead automatically to programs that can help to ' 
. ameliorate that problem. In order to design programs we have to call 
upon two sorts of knowledge, first of all . basic social science under- 
standing of a problem helps to point out the leverage points that may 
> be used to change the . distribution and-.ext^nt of a problem. Secondly. 

we need to know something about the institutional arrangements that 
• are implicated in a problem% that workable policies and programs 
can be designed. For. example, our basic understanding of how students 
learn might suggest that lengthening the school day would" be an effec- 
' tive wajr. of increasing the rate of learning of Certain ski lis. However, 
in constructing a^ program, we, would have to. take into acco'unt the fact 
that the lengthening of the school day is a maUer that would cfincern 
teachers and their organizations as well as factoi-s involving the 
capacity of schools to do so. other persons involved.. including parents 
and school infra-structura personnel, etc. 

Another example may f^lp to illustrate how complex are the problems 
tha^t .arise in,the design of appropriate programs. To know that there" 
exist learning disabilities among school children by itself does not 
suggest ^vhat would be an appropriate policy respof>se. To construct a 
policy response that has a chance to ameliorate educational problems 
typically means that there are some vali.d theories about how such ' 
problems arise and/or" how sbch problems could be reduced. To pursue the 
learning disabilities question further, an appropriate set of knowledge 



useful to policy formation would be theories that link learning disabll 
Itles to school experiences. Note that It Is not crucial that learning 
disabilities be created by school experiences but oTily that school 
experiences Influence to some appreciable degree the devcilopment .of 
learning disabilities. There Is little that policy can do (at least 
in theshort run) about those "causes" of learning disabilities which 
have their roots In factors that are traditionally thought to be outside 
the sphere of policy relevance. Hence» knowledge about the role of 
family relationships In learning disabilities is not policy relevant 
(at present) because it concerns causes with whtch the policy sphere 
has traditionally not concerned itself. In contrast, research and 
knowledge dealing with the effects ^f schools, teachers, educational 
facilities and the like are currently. policy relevant because social 
policy has been directed towards changing schoolSv the supply of educa- 
tipnal facilities, the quality of teachers and similar Issues. 

Jbis conception of policy relevant research Is one that cauVes 
considerable misunderstanding concerning the relationships between 
* basit 'aiVd;appl iedVsocid^ research. A policy oriented resear(:h is one that 
tries to model how policy^ changes can affect; the^henomenon in question. 
Knowledge about the phenomenon Mr se^-the province of baslc^fscipllnary 
concerns-- may be important to understand how policy might be changed 
to alter the course of the sociar problem in question, but such basic 
research often does not. For example, laboratory studies of learning 
processes oryof the development of aggression in persons may not be ; 
at all useful to educational policy makers .or to criininal Justice 
officials. Perhaps the clearest way to put the difference is that ' 
policy' oriented and basic research are not contradictory or in conflict , 
with each other but that in addition to understanding processes, policy 
oriented research must als6 be concerned with the connettions between, 
the phenomenon and how policies and programs may affect the phenomenon 
ih^ question. ^ ) . ' 

In addition, to construct a prograf that is likely to be adopted 
by an organization, we need to have intimate knowledge of what motivates 
'such systems to i>hange and adopt new procedures. Like other large scale 



organizations, schools, factories, social agencies and the like are 
resistant to change, especially when the changes do not Involve corres- 
ponding changes In reward systems. For example, an educational program 
that Is likely to work Is one that provides positive Incentives for 
school systems and Individual teachers to support and adopt the changes 
In learning practices embodied In the program. 

Inadequate attentlon^to the organizational coptexts of programs 
Is one of the more frequent sources of program Implementation failure. 
Mandating that a particular program be delivered by an agency that is 
Insufficiently motivated to do so, Is-poorly prepared to do so, and 
has personnel that do not have the skills to do so Is a sure recipe 
for degraded and weakened interventions. Indeed j sometimes no programs 
at all are delivered under suqIi circumstances (Rossi, 1978). 

Answers to the question "Can we,'^o anything about the problem?" 
can come from a variety., of sources. Existing basic research efforts 
(whatever the method) aimed H unde»*standing general educational . processes 
are one isource, although mastering "thl^ diverse technical llteriiture 
is often difficult. Commissioned review papers may be an easy i»ay 
to bring together .in a pre-digested form the set of relevant existing ' 
basic research findings. 

It should be noted that basic research Is often, not useful to 
policy needs because policy relevant concerns have-not been directly 
addressed in the research. For example, studies of dbildnen who are . 
disciplinary problems in school may stress understahdSg the IJnks 
between the falfiily situations of the children and their behavior. But, 
for policy and programmatic purposes, it would be considerably more 
useful if the researchers had spent their time studying how discfplinary 
Systems within schools affect the rates at which 'disciplinary problems 
appeared within schools. Policy mutable variables (those that 'can be . 
changed by policy) often tend to be slighted in basic research since „ . 
policy is ordinarily diliy a small cbntributor to the total .causal system 
that gives rise to a pr(l(blem. ' 

General research consciously linked to thb role that schools and' 
the educational system generally play in learning and other behavior 



may be the best answer to policy needs. Such research would pay special 
attention to policy mutable conditions of Indtvlflual behavior. Policy 
relevant general research may take a variety of forms, ranging all the 
way from systematic observational studies of school children to care- 
fully controlled randomized experiments that systematically vary the 
"policy relevant expeHences^^o^^ef^ Without slighting basic 

research support. It s ho ul^be emphasized that fostering such policy 
relevant general research needs special grant an\l contract research 
programs with review personnel that ar6 familiar With what -is policy 
jrelevant. It should be noted further that policy relevant general - 
research should be accomplished. with greater care and with more careful 
attention to problems of internal , external and construct validity 
precisely becausis of the importance that such research may have in the 
formation of policy and the design of programs. 

Will Some Particular Program liilork ? * 

When some particular program has been identified that appears to 
be sensible according to current basic knowledge in the field,, then 
the next step is to see whether it is effective 'enough to be worth 
developing* into a program. It is at this point that We recommend the 
use of randomized controlled experiments in which the candidate prpgrams 
are tested. Such research is extremely powerful in detecting program 

'effects, because randomly allocating persons (or .other units , e.g., 
classes}, to an experimental group (to which the tes^e'd program is 
administered) or to a control group ;(from whom the'program is withheld) 
assures that all ^pe factors that ordinarily affect the educational 

• process in question are on the average distributed identically among 
tfiose who receive the program compared to those who do not. Therefore, 

, raililqmizatlon on the average eliminates causal processes that may be 
confounded with the educational intervention and hence enormously 
enhahces internal validity. That is,' the problem of spurious ipterpr?- 
tatlons can be quite effectively "addressed. 

We advocate the use of randomized experiments -at this stage in the"". 

■ •. . ( • ' 

,development of a program both be'ca use they are powerful, in the sense 



used above, and also because a potentially useful program ought to 
have the best chance of working when administered ,1n a program that 
is run^by dedicated researchers. However, this commitment In no way 
undermines the complementary potential of ethnographic studies, partic- 
ularly to document wh^ a particular Intervention succeeds or falls. 

Developmental experiments should be conducted ordinarily on a ' - 
relatively modest scale and are most useful to policy needs when they 
test a set of alternatives programs that are Intended to achieve the 
same effects. Thus.. It wolild be more useful for an experiment to test 
several ways of ameliorating learning disabilities since the end result 
would be to provide Information on several possibly equally attractive 
(%)r1or1) methods of amel1o»*at1ng that condition. 

• There are many -good examples of the field testing through randomized 
experiments of promising p;rograms. The five Income maintenance experi- 
ments were devised to test' under varying conditions the Impact of 
negative Income tax plans as substitutes for existinf welfare programs 
(Kershaw..* Fair.. 1976; Rossi 4 Lyali.' 1976). The Departme^^^ 
Labor tested' the extension of,unemployment benefit coverage to prisoners 
released from state prisons Iri a s.mall randomized experiment conducted 
in Baltimore (Lenlhan. 1976) .Randomized experiments have also been 
used to test out national health insurance plans and direct cash subs 1<11es 
for housing to poor families. At Issufe In most of the randomized experi- 
ments were whether the proposed programs would produce the effects - 
intended and whether undesirable side.effects could be kept at a minimum^ 
Thus the Department of Labor LIFE experiment (Lenlhan. 1976) was designed 
to see whether released felons would"be aided to adjust to civilian r 
life through increased employment and lowered arrest rates. 
Can Agencies- Deli ver ah E ffect i ve Program? . Flpid teWino thP Prnr ^r.n. ' 

Once an effective treatmentnias been isolated, the next question 
that can be raised Is whether a program incorporating' the treatment ' 
can be administered through government agencies. Implementation of 
programs Is always somewhat an open. qllstlon. Agencies are no different 
from other organizations In resisting changes that are unfamiliar and 



perhaps threatening. Interventions that worlc well with dedicated researchers 
administering them often fall when left to less skillful and less dedicated 
persons, as one might find In federal, state or local agencies. Hence, 
It Is necessary to test whether agencies can deliver Interventions at 
the proper dosage levels and without significant distortions. Randomized 
controlled experiments, as described above, are again an extremely 
powerful tool and appropriately designed randojnl zed" experiments should . 
compare several modes of delivery to be tested. 

Such field testing has been undertaken In a systematic way In a 
number of human services areas. For example, the Department of Housing 
and Urban Development commissioned ten cities to carry out demonstrations 
Qf housing allowance programs In order to test how best to administer 
such programs, leaving It to each city housing agency to setup Its 
housing allowancf^ program within tho. broad limits of specified payment 
levels a»d rules for client eMglh ity. Following up on. the LIFE. - 
experiment rioted above, the TARP Iments funded by the Department of- 
Labor provide another example: TWl s tes were chosen to run a pro grain 
which provided eligibility for unemployment beneffts to persons released 
frojri those* states' prisons. Each state ran the program as a randomized' 
experiment w1th°pa^ments provided through the Employment Security Agertcy * 
of each state:, (Rossi , Berk, & Lenlhan, 1980). 

A special' role at this stage can be played by process research 
activities which employ specially sensitive and observant researchers 
in close contact with the field testing sites. Observations collected 
by such^ researchers may be extremely useful in understanding the 
. specific concrete processes that either Impede or fac11itat<a the working, 
-of the program. Again, ethnographic accounts can be extremely instructive 
In addressing the "whys." ' • 

• ' ■ ft ' 

II. AccQjuntabI 11 ty Eval uatlon 

Once a progran) has been enacted and Is fun^t1onlVi3, one of the 
main questions thft Is asked concerns whether or not the program -Is 
In place appropriately. Here the Issues are not s6 much whether the 



program is achieving its intended effects, but whether the program 
is simply running in ways that are appropriate and whether-))roblems - 
have arisen in the field to which answers need to be given. Programs 
often have to be fine-tuned in the first few years or so of operation: 
Is the Program Reaching the Appropriate Schools and Children ? 

Achie»ing.appropriate coverage of beneficiaries is offen problem- 
atic. Sometimes a program may simply be designed inadvertently so 
•as to be unable to reach and serve significant'portions of the total • 
^intended beneficiary populatiion. For example, a- program 'designed to 
provide food subsidies to children who spend their days in child ca^e 
facilities, may fail to reach a large proportion of such' children if 
regulations exclude child.care facilities that are serving fewer than 
five children. A very large proportion of children wha are cared for 
during the day outside their own households are cared for- by women 
who take in a few children into their homes (Abt Associates. 1979): 
Experience wUh social programs over the past two decades has 
shown that there are few. if any. programs that achieve full coverage 
or near full coverage of intended berieficiartes. especially, where 
coverage depends o/i positive actions on the part of beneficiaries.. Thus 
not all persons who are eligible for Social Security payments actually 
apply for them.. estimates range up to 15% of. all eligible beneficiaries. 
Still others may nat be reached because facilities for delivering the 
-ser\*ices» involved are n3t accessible to them. And so pn. " 

. Although a thorough needs assessment of child care problems, would 
have.brought to light the l^act that so l^rge a proportion of child 
care was furnished by smaH scale" vendors, arid hence should have been 
taken into account in* drawing up administrative regulations . such 
might not have been, the case. In addition.^^tterns of the problem 
might change ov?r time, sometimes in resporfse\o the existence of i 
program. Hence there is some need to review .'f retime to time howVnany 
of the irftended beneficiaries are being covered by a program. \ , 
There is also another side to the cover^gejproblem. Programs m^V> 
cover and extend benefits to persons or' organizations that were not 
inteh(ted to be served. Such' unwanted coverage may be Jmpossible to 



avoid because of the ways In which the program 1s delivered. For 
example, while Sesame Street was designed primarily to reach disadvan-, 
taged children^. It also turned out to be attractive to advantaged . 
-children and many adults. There Is no \rfay that one can keep people from 
viewing a television program once broadcast (nor Is It entirely 
desirable to do so in this case} and hence a successful TV program 
designed to reach some children may reach them but also many others 
(Cook, et al.", 1975). ^ . 

While the unwanted viewers of Sesame Street are reached at no 
additional costs, there are times when the "unwanted" coverage may turn 
out to severely drain program resources, for example, while Congress 
may have desired to provide educational experiences to returning veterans 
through the GI Bill and its successors, it was not clear whether Congress 
had injDind the subsidization of the many new proprietary educational 
enterprises, that came into being primarily to supply "vocational" 
education to eligible veterans. Or, in the case of th? bilingual 
education program, many primarily English speaking children were found 
to be program beneficiaries; 

' Studies designed to measure coverage are* similar in -principle to 
those discussed under "Needs Assessment" studies earlier; In addition, 
overcpverage may be studied as a problem through program administrative 
records. Undercoverage, however, often Involves in many cases - 
commissioning special surveys. 

Are Appropriate Benefits Being Delivered? Program Integri^^se'arch 

when program services depend heavily on the ability of many agencies 
to recruit and train appropriate personnel or to retrain existing personnel 
or to undertake significant changes in standard X)perat1ng procedures, it 
is sometimes problematic whether. a program will always manage to deliver 
tb beneficiaries what had been intended. For many reasons the issue of 

program integrity o#ten becomes a critical one that may reqirtne additional 

« 

fine-tuning of bas-lc legislation, ffr of administrative regulations. 

Several examples may highlight the Importance of this Issue of 
educational programs. While funds may be provided- for school systems 
to upgrade their audio-visual equlpmint, and schools may. purchase them, 

• ■ • "'' 
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It is often the case that such equtpment goes unused either because 
there arfe no persons trained^ to use the equipment or because audio- 
visual materials are not available "(Rossi & Biddle, 1966). Or a 
> new curriculum may be designed and made available to schools, but few 
■ schools are able to use the curriculum because teachers, find the curric- ' 
uluro too' difficult to use. . - " 

In other cases, the right services are being delivered but at a 
^ylevel ttiat is too low to make a significant tmpact on t,eneficiaries. . 
•Thus a supplementary reading instruction program that means an additional 
forty minuties per week of reading instruction may not be-delivered at 
sufficient strength and quantity to'make any difference in' reading 
progress. 

Evaluation reseaf^ch designed to measure what is being delivered 
roay-Hbe designed easily or may'involve measurement problem/ tjf consider-' 
able complexity. Thus it may be very" easy to learn from schools how 
many hours per week their audio-visual eijuipment is- used, but very 
difficult to learn what is precisely going on inside^a classroom when 
teachers attempt to use a new teaching method, where the program implies 
changes in teaching methods, classroom organization or qther services 
that are highly dependent on persons for delivery. Measurement that ■ 
would require direct observation of classroom activity may turn out to 
be very expensive to implement on a large scale. ^ 

Often for purposes of fine-tuning a program, it may not be 
^necessary to proceed on a mass scale in doing research. Thqs, it may not 

matter whether a particular problem in implementing a program occurs 

frequently or infrequently, since if it occurs at all it is not desirable. . 

Hence for program fine-t4ining small scale qualitat^tve observational studies 

may be most fruitful . 

Programs that depend heivily on personnel for delivery and/or ' 
which invoVve cornplicated programs- and/or which call for individualized 
treatments for beneficiaries are especially good candidates 'for careful 
and sensitive' fine-tuning research. Each of the characteristics enumerated 
,in the.previous sentence are ones that' facil itate difficulties in 
appropriate Implementation. In effect, this statement means that 



ERIC 



23-* ^ 



personalized human services that are complicated are problematic In ^ 
itlotlvatlng personnel to deliver the services- appropriately and skillfully; 

Are Program Funds Being Used^ Appropriately? 'Fiscal Accountability ' 

The accounting profession has been In operation considerably 
longer thiin has program evaluation. Hence procedures for determining 
whether or not program funds have been used responsibly .and as Intended 
are well established and hence are i€t problematic^ - However, It should 
be borne 1ti mind that fiscal accountability measurements cannot 
substitute for the studied mentioned above. The fact that funds appear 
fto be used as Intended In an accounting sense may not mean that program 
services are being delivered as Intended, In the ^ense discussed above. 
The conventional accounting categories used In a fiscal audit are 
ordinarily sufficient to detect, say fraudulent expenditure patterns, 
but may be Insufficiently sensitive to detect whether services are 
being delivered In the requisite level of substantive integrity. 

It is also Important to keep In mind that the definition of costs 
under accounting principles differs froni the definition of costs used 
by economists. For accountants, a cost reflects conventional bookkeeping 
entries such' as out-of-pocket expenses, historical costs (I.e., what 
the purchase price of some Item was), depreciation and the like. Basically, 
iiccountants focus on the value of current stocks of capital goods and 
Inventories of products coupled with "cash flow" concerns. When the 
question Is whether program funds are being appropriately spent,' the 
accountant's definition will suffice. However, economists stress 
opportunity costs defined ^n terms of what Is given up when resources 
are allocated to particular purposes. More specifically, opportunity 
.costs reflect the next best use to which the resources could be put. 
For example, the opportunity cost of raising teachers' salaries by 10% 
may be the necessity of foregoing the pwrcWase of a new"^ set of textbooks. 
While opportunity costs may not be especially important from a cost- 
accounting point of view^ opportunity costs become critical when cost- 
effectiveness or benefit-coist- analyses of programs are undertaken. We ^ 
will "have more to say about these issues later- * 
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HI. Program Assessment Evaluation 

The evaluation^ tasks discussed under accountability studies are. 
directed mainly to questions dealing with how well a program is running. 
Whether or not a program is effective is a different tissue, to which 
answers are not easily provided. Essentially, the question asks whether 
or not a program achieves its goals over and above what would be expected 
about the program. ' ' 

Many evaluate rs*consider that the effectiveness question ifi 
quintessentially evaluation. Indeed, there- is some. Justification for 
that posttioh since effectiveness assessment is certainly more difficult" 
to accomplish, pequiring higher levels of skills and ingenuity than 
any of the previously discussed evaluation activities. However, there 
is not justification for interpreting every evaluation task as calling 
for effectiveness assessments, as apparently some evaluatqrs have done 
in the past, aided in their misinterpretation by imprecise requests 
for' help from policy makers and admin1s1;ratprs. ~ 

^Can Effectiveness of a Program be Estimated? The Evaluftbilitv Question 

A program that has gone through .the stages described earlier in 
this chapter should provide few obstacles to eVal uatlon for effective- ' 
ness in. accomplishing its goals, But there are many human, services 
programs that present problems for effectiveness studies because one 
^liiore of several criteria fbr evaluation are absent. Perhaps the 
mo^important criterion, one which is frequentlyabsent, is the 
•laclcW well formulated goals or objectives for the prograiij. for 
examplV, a program that is designed to raise the level of learning 
among certain groups of school children through the provision of per 
capita payments to schools for the purpose is not evaluable for its ' 
effectiveness without considerable further ^specifi cation of gotals. 
Raising the level of learning as a goal has to be specifjed further to 
indicate what is "meant by "levels" and the kinds jdf learning achi eye- 
ments that are deemed relevant. " 

A second cri teria Ia that the program in question be 'wel l sp^l fled. 
Thus a program that is desi grtd' to 'make social work agencies be more 



effective by encouraging innavations is also not evaluable as far as 
effectiveness is concerned. First, the goals are not very well specified, 
but neither are the means for. reaching goals. Innovation as a means 
of reaching a goal is not a method, but a way of proceeding. Anything 
new is an innovation and hence such a program may be encouraging the 
temporary adoption" of a wide variety of specific techniques and, is 
likely to vary widely from site 'to site. 

Finally, a program ,1s evaluable fr'om an effectiveness point of 
view only If it is possible to estimate in some way what is the expected 
state of beneficiaries in the absence of the program. As we will 
discuss below, the critical hurdle in effectiveness studies is to develop' 
comparisons between beneficiaries that experience a program with those 
who' have not had such experiences. Hence a program that is universal 
in its coverage and that has been going on -for some period of time cannot 
be evaluated for effectiveness. For example, we cannot evaluate the 
effectiveness of the public school systems in the United States, bjBcause 
it is not possible to make observations on Americans, cities, towns, 
counties and states that do not (or recently have not had) public school 
systems. - * . . - 

. .FiYially effiectiveness evaluations are the most "difficuTt evaluation 
tasks undertaken by evaluators, requiring the most highly trained 
personnel for tf\eit undertaking, and considerable sums of money,, for data 
collection and' analysis. Few evaluation units have .this expertise and 
eJcperience $o design and/or carry out effectiveness,,evaluations. 
Especially rare are such capabilities on the s.tate and local levels. 

This discussion of effectiveness .fivaluability is raised here 
because we believe that often evaluators are asked to undertake .tasks 
that are Impossible or close to Impossible. Thys It Is^not sensible 
,for policy makers or program managers. to^ M^ for effectiveness 
evaluation to be undertakeh by state and local evaluation units, at 
least at this stage in the development of staite and local capabilitifis.- 
N6r does it make much sense to undertake large scale evaluations of 
programs that have no nation-wide' uniform goals but are locally defined. 



ifence the evaluation of Title I or of Head Start and similar programs 
should not be undertaken or. called for llgHtly, If at all. 

Techniques have, been develpped (Wholey, 1977) to determine whether 
or rtot a program Is evaluable In the senses discussed above. Congress 
and other decision makers may want to.commiss Ion such studies as .a 
first step rather than to assume that all programs can be evaluated. 

Finally, It may be v/orth mehtloning In passing that questions of 
evaluabinty have'ln the past been used' to Justify "goal rfree"" evaluation 
methods (e.g., Scrlven, 1972; Deutscher, 1977). The goal-free advocates 
have contended that since many of a program's alms. evolve over time, 
the "hypothetlco-deductlve" approach to impact assessment (Hellman, 1980) 
is at best incomplete and at worst misleading^ In our v1ew,/lmpact 
assessment necessarily requires some set of program goals although 
whether they are stated In advance and/or evolve over time does have 
important Implications for one's research procedures. In particular, 
evolving goals require far more flexible research designs (and researchers) 
In other words, there cannot be such a thing as- a '■goa^free" Impact 
assessment . At the same time, we have stf-e^sed above that there are 
otheB Important dimensions to the evaluation enterprise In which goals 
are far less central . For example, a sensitive monitoring of program' 
activities can proceed productively without any consideration of ultimate 
goals. Thus, goal-free evaluation approaches can be extremely useful as ' 
long as the questions they Can address are clearly understood. 

Did the Program Work? The Effectiveness Questfon 

discussed above, ^ny assessment of whether.' or not a program 
"worked" necessarily assumes that It is known what the program was' - 
supposed^ to accbmpTlsh. - For a^arlety of -reasons, enabling legl si atf on 
establishing programs may aRpear to set relatively Vqgue goals or' ! 
objectives for th^ program and It -Is necessary. during the "design phase"- 
(as discussed above) to develop specific goals. Goals fbr such general 
programs may be developed by program administrators fhro^^^ - 
of soda! science theoryt past .research and/or studlel^ 



that the.prograiD Is supposed to anfeHorate. Thus Title r was designed 
to enrich the educational experiences of disadvantaged children through 
providing special funds to state and local school systems that have 
relatively large proportions of disadvantaged children on their rolls. 
However. In order to accomplish this general (and too general) objective, 
it was necessary. In local school systems to develop specific programs 
with'thefr own goals. Thus some goals or sets of objectives may be 
developed as a program goes along (Chen & Rossi, 1980). 

However goals may be aStabllshed, the Important point isthat 
it Is not possible to determine whether a program worked without 
developing a limited and specific set oT criteria for establishing the 
condition of "having work^'d.". For example,, It would rpt have been 
possible to develop an assessment of whether Sesame Street "worked" 
without having decided t,fi3,t Its goals were to foster reading and number 
handling skilTs. Whether these goals existed before the program was • 
designed or whether they emergad after the program was In operation Is 
less Important for our purposes than the fact that such goals existed. 

Programs rarel J succeed or fall In absolute terms. Successor 
failure Is always relative to some benchmark. Hence an answer 4;o "Did 
the program work?" requires a consideration of "Compared to what?" 

The' development of appropriate comparisons cajn. proceed along al 
least three dimensions: comparisons across differeint subjects, comparisons 
across different settings and comparisons across different times. In 
the first Instance, one might compare the performance of two sets of 
students In a given lass In a glveln classroom period. In the second^ 
• Instance, one might compare the performance of the same set of students; 
In two different classroom settings Cnecessarlly at two different points 
in time). In the third instance!, one might compare the same students 
in the same classroom, but at d1fferent"po1nts in time. . 

As Figure 1 .1 ind1ca,tes, it is also possible to mix and match these; 
three fundamental dimensions to develop a wide variety of comparison 
groups; For example, 'comparison group 2 (Gg)* varies both the subjects 

*We have used the term "cothpari son . group" as a general term to be 
distinguished from the term "control group."; Control groups are 
comparisoTJ groups that have been 'constructed by?.random assignment.^ 



Figure 1.1 
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**^ ''0'<es. while logically possible, lead to comparison groups 
which make no sense substantively ^n this context. "^^"^ 



and the setting although tjie time Is the same. Or. comparison group 6 
(Cg) varies subjects, the setting and the time. However, with each 
added dimension by which one or 'more comparison groups differ from the 
experimental group, the, number of threats to Internal validity necessarily 
Increases. For example, the use of comparison group 4 (different setting 
and different time period) requires that assessment bf program Impact 
simultaneously take Into account possible confounding- factors associated 
with such things as differences in student background and motivation and 
sHch things as the "reactive" potential of different classroom envlroV 
ments. This In turn requires either an extensive data collection effort 
•^to obtain measures on thete confounding factors coupled with the applica- 
tion of appropriate statistical adjustments (e.g., multiple regression 
analysis), or the use of randomization and thus, true control groups. 
Randomization, of course, will on the average eliminate confounding 
Influences In the analysis of Impact. On grounds of analytic sImpHclty 
alone» it Is easy to see why so many expositions of Impact assessment 
strongly favor research designs based on random assignment. In addition^ 
It cannot be overemphasized that appropriate statistical adjustments^ 
(In the absence of randomization) through multivariate statistical . 
techniques regulre^a number of assumptions that are almost Impossible 
to fully meet In practice."* For example* It Is essential that measures 
of all confounding Influences be Included In a fonnai model of the 
program's Impact, that their mathematical relationship to the outcome 
be properly specified (e.g. , a linear additive form versus a multipli- 
cative form)* and that the confounding Influences be measured Without 
error! Should any of these requlv^ements be violated, one risks serious 
b1|is In any estimates of program Impact. 

At the same time, however, random assignment Is q^en« Impractical 
or even Impossible. And feven when'random assignment Is feasible. Its 
advantages rest on^randomly assigning a relatively large number of 
subjects. To randomly assign only two schools to the experimental group 

*There are some research, designs which while not based on random assign-, 
ment, do readily allow for ^u^ estimates of treatment effects • 

through multivariate statlstltal adjustments. See, for exampl e, BarnipW» 
Cain and Goldberger (1980). \ ' ' 
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and only two schools to the control group, for exampli, will not allow 
on tho fyerj£e equivalences between experimental s and controls to 
materialize. .Consequently, one Is often forced to' attempt statistical 
adjustments for Initial, d1fferaii||s between experimental and comparison 
subjects. 

> The use of multivariate statistical adjustments raises athost of 
questions that' cannot be addressed In detail here. Suffice to say 
that- despite the views of some that anything that can go wrong, will 
go wrong, extenslvef'practlcal. experience suggests a more optimistic 
conclusion.' Quite often, useful and reasonably accurate estimates of 
program effects can be obtained despite modest violations of the required, 
statistical assumptions. Moreover, available statistical technology 
^1s evolving rapidly and many earlier problems now-have feasible solutions, 
at least in principle; (For a review of some recent statistical develop- 
ments in the context of criminal Justice evaluation, see Berk, 1980.) 
To consider' the usefulness of assessments not resting on Vandom 
.assignment, consider a recent eyaluation (f^bertson, 1980) of the 
effectiveness of driver education programs j^In reducing accidents among 
16 to 18 year olds. The evaluator took ady'antage of the fact that the 
Connecticut legislature decided not to subsidize such pPbgrams "Within 
local school systems. In response to this move, some school disiiricts 
dropped driver education out of thel¥ high school curriculum and some 
retained it. TV*o sets of comparisons were possible: accident rates 
for persons of the appropriate age range in the districts that dropped 
the program were computed- before and after | the program was. dropped and 
accident rates for the same age groups In' the districts that retained 
driver education compared to the accident rates in districts that dropped 
the driver educatior^rogram. It was found that the accident rates 
significantly Jropped in those districts that:, dropped the program, a 
finding that led to the interpretation that thV program increased 

accidents because young people were led to! obtain licenses earlier 
than otherwise. i 

It is sometimes possible, to eitfter enhance or partially bypass 
comparison group problems b,v resorting to tome set of externaV criteria 



ERIC 



AS a bAse11n«. For examplOi It Is corranon In studies of desegregation 
or affirmative action programs to apply various measures of equity as 
a "comparison group" (Baldus & Cole, 1977). Thus, an assessment of 
whether schools In black neighborhoods are being funded at comparable 
levels to schools In white neighborhoods, might apply the criterion, 
that disparities In et^cess of plus or minus &X 1n per pypll e)(pend1tures 
Indicate Inequity and hence failure (Berk & Hartman, 1972). However, 
the use of $uch external baselines by themselves still leave open the • 
question of causal Inference. It may be difficult to determlnia If the 
program or some other set of factors produced the observed relationship 
between outcomes of Interest and the external metric. 

It Is also Important to understand that distinguishing between 
success and failure Is not a clearcut decision since there are usually, 
degrees of success or degrees of failure* While decision makers may 
have to make binary decisions whether, for example, Jbo fund or not to 
fund, the evidence provided on effectiveness usually consists of state- 
ments of degree which then have to be translated Into binary t^rms by 
the decision makers. Thus It may turn out that a program that succeeds 
In raising the average level of reading by half a year more than one 
would-ordlnarlly expect to be reading gains, such a program may be less 
successful than one which has effectiveness estimates of a full year. 
This quantitative difference has to be translated Into a qualitative 
difference" when the decision to fund one rather than the other program 
. comes Into question. . 

.In short, the construction of effectiveness evaluation studies Is 
a task that requires a considerable amount of skill. Hence ^uch 
effectiveness studies should be called for when there Is sufficient 
reason to.belleve that the circumstances warrant such studies; as 
mentioned earlier In^thls chapter,, and on whether or not\capab111ty If 

available in the ynlt responsible for the study. 

» * ■■■■(•■ 



Mas thft Program Worth It ? The Economl 9 Efflcl ancy Queitlon 1 

Qlvan a proflram of proven effactlvinasi, the next question one 
might reasonably raise Is whether the opportunity costs of the programs 
• are Justified the gains ach1,eved.* Or the same question might be 
more narrowly raised In a comparative framework, Is Program A more 
"efficient". than Program B, as alternative ways of achieving some 
particular goal? 

The main problem In*^ answering such questions centers around 
establishing a yardstick for such an assessment. For example, would, 
it be useful to think In terms of dollars spent for units of achieve- 
ment gained, In terms of students covered, or In terms of classes or 
schools that corrfe under the program, 

The simplest; way of answering efficiency Issues 1? to calculate 
cost effectiveness measures, dollars spent per unit of output. Thus 
in the case of the Sesame Street program, several cost effectiveness 
^'^measypes were computed: 

"Dollars spent per child hopr of viewing;' 
, —Dollars spent per additional letter of, the alphabet learned. 
Note that _thfi_^cond me^^^^^ Implies knowing the effectiveness of the 
program, as established by an effectiveness evaluation. 

The most complicated mode of answering the efficiency question is 
to conduct a full-fledged cost-benefit analysis In which all the costs 
and benefits are computed. Relatively few full-fledged cost-benefit 
analyses have been made of socldSl programs becausi It Is difficult 
Y put all the costs and all the benefits Into the same yardstick terms. 
An principle, it is possible to convert into dollars all the costs and . 
benefits of a program. In practice, it is rarely possible to do so 
without some disagreement on the valuation placed, say on learning an 
. additional. letter of'the alphabet. 

An additional problem with full-fledged benefit-cost analyses is 
that they must consider the long run consequences not only of the 
program, but the Iqng run consequences of the next best alternative 

*Recal y that opportuni ty costs address the " "foregone benefits of the 
next best use of the resources in question (Thompson, 1980:65-74) * 
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forsgona. This Immediately raises the question of "(iUcountlngi"* the 
fact that resources Invested today In some soelal program may produce 
consequences over a large number of succeeding years that have to be 
compared to the consequences from the next best alternative over a large 
number of succeeding years. Tor' example*, a vocational program in Inner 
city high schools needs to address (among other things) the long run 
Impact of students' earnings over their lifetimes. This In turn requires 
that the costs and benefits of the program and the next best alternative 
be phrased In terms of to^ay.'n dollars. Without, going Into the arcane 
art of discounting, the problem Is to figure out what a reasonable rate 
of return over the long run for current program Investments and competing 
alternatives might be. And, one can obtain widely varying assessments 
depending on what rate of return Is used (Thompson, 1980). 

Evaluation In Evolution - 

The field of evaluation research Is scarcely out of Its Infancy as 
a social scientific field of Inquiry. The first large scale field 
experiments^, ware started In the middle 60s. Concern for large scale 
national evaluations of prbgrams also had their origins In.th^yiar on 
Poverty. The art of designing large scale Implementation and monitoring 
studies Is just now evolving. Concern with the validity statuses of 

* 

qualitative research has just begun. And so* on. 

Perhaps what Is most Important as a developing theme Is the 
importance of social science theory for evaluation. It has become 
'Increasingly obvious that social policy Is almost a blind thrashing about 
for solutions. Guiding ;^he formation of social policy through sensitive 
and Innovative appllqatiloiis of general social science theory and empirical ^ 
knowledjie Is beglnnlrig td']pccur more and tmore. This development Is 
further enhanced by tfje Increasingly held realization that errors In model 
•specification are errors In theory. Hence there Is no good policy .without 
good understanding of the problem 1n\fplved and of the role that policy 
can play. Nor Is there any-good evaluation without tjieoretlcal guidance 
In model ling policy effects. 
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USES AND USERS OF EVALUATION 

* * * 

Marvin C. Alklh 
UnlversHy of California, Los Angeles 

Introduction , « 

Social scientists like to attribute rationality to the various acti- ' 
vltles conducted In social systems. Evaluation Is one of those presumable 
rational activities. Indeed, the presumption holds that ^vpluatlon's 
rationality is attributed to Its purposlveness— it serves a useful purpose. 
It may be safely said that all evaluations serve a purpose perceived to 
be ^useful to someone. An evaluation might be conducted simply to satisfy 
legislative requirements, or to satisfy the Illusion tfrat the organization 
Is engaged 'in'systematlc self-evalua.t1on. "in some quarters J\ each ^f these' 
Is considered to be a useful purpose, or else they would most certainly 
not hfve been Initiated. 

Four types of situations have been outlined (Alkin, 1?!75) which do 
not ^im evaluation at decision making: (1) window-dressing, (2) legal 
requirements, (3) public' relations, and (4) professional prestige^ Window 
Jiyssing evaluations seek justification of decisions ..already made. Evalu- 
ations commissioned simply to comply with legal reaurl-ements often deli'ber- 
a^ly ma|e evaluations solely a pro forma exercise. Soirke evaliiations are 
coninissiolled simp]y as' public relations destuyies where the intent is to- 
demonstrate the obiectivity of decision unakers. ; Professional prestige is 
designed to enhance individual reputations of the report cbmmissioners as 
Innovators. ^ , 

However, it is reasonable to argue that Jud^nt of evaluation 
rests not only on its serving a generally useful •pur'pose. but on the ex- 
tent it abtually Informs decision making— the most highly desired of 
potentially useful purposes. This point is well articulated by WeiU: 

the basic rationale for evaluation M that it provides ' 
infofTOtlon for action. Its primary justification is that 
it contributes to the rationalization tff decision making. 
^ AithdWh it can serve such other functions' as knowledge- 
building and theory-testing, unless, it gains, serio^us hearing 
when program decisions are made, it fails in its major purpose. 

. • ^ (1972; p. 318) 



- Iftifortunately, there is wide-spread disagreement as to whom evalu- 
ation has provided "information fbr action," or more broadly, what; 
constitutes use. Furthermore, it is not clear who may be considered + 
."the users." In the'rema-inder of this, chapter, these two major issues will 
be addressed with f expect to national evaluations In the Education 
Department: yho are the users? and what constitutes use? 

The Users 

For evaluations to be used, there must be someone to make use of them. 
And, the ni^ture of^hase users is dependent on the level at 

which evaluation takefls place. Using the focus on the current studly. 
national evaluations of e|iicat1o^, as the ba^ls for further analysts, the ^ 
various categories of appropriate^ users will be examined. A second issue 
to be considered is the interplaty between those who commission evaluation 
studies and other potential users. 

Categories of users ♦ While there are a number of ways to typify * 
user groups , for the' purposes of this chapter! the fol lowi ng category , 
system will be employed: (1) Congress » (2) Education Department managp- 
ment» (3) program management > and (4) StA and LEA management. 

Although It may be fashionable to demean the use of evaluation research 
data by Congress » it is nonetheless clear that Congress is in fact a 
major user of evaluation research. Perhaps undue expectations of us6 pre- 
vail and the failure to consider a particular written evaluation report 
in toto as part of the re-f uncling decision* might easily lead to the con- 
clusion that evaluations awe not- used. Nevertheless, there is little 
question that the informatil^n from evaluation research filters into the 
' system and is used over a period of time. A study consisting of twenty-- 
six interviews wHh congressional staff members (Fl^no/ 1980) provided 
evidence that legislative aides identified student achieventent scores 
as the second most Important types of information'^ext to the cost of 
the program). This evidence, along with the data on use presented in the 
U.S. Office of Education Annual Report, suggeists strongly that Congress 
does; In fact, use the results of evaluation research. A limited ^5amp It n^_. 
of the ways evaluation use takes place are: as input fbr making funding 
decisions with respect to the program; as input fcr changing the scope v 



, of.rttie progr^^m; iri attitude formation about the program for potential 
future use. 

Examples: Changes In, Xi tie VII eligibility requirements 
were made In the Education Amendments of 1978 following from 
the OEO "Evaluation of ESEA Title VII bilingual education 
prograB^" 1974-1978; changes In Title I legislation- In 1978 
were mide based In part on a variety of OED studies and ln 
particular on a congressional ly requested and focused study 
which was directed by Paul Hill of RAND Corporation. 

A second category of user, the upper .level management of the new 
Education Department, might appropriately employ evaluation Information In 
a nimber of ways within the. organizational structure. For example, eval- 
uation research information might contribute to judgments about the qual- 
ity of a program, the leadership being provided by program management, the 
scope of the program, or the appropriateness of the audience— to name 
biJt a few. It may also.provide^ input for decisions about modifying 
required rules Ind regulations to obtain better Conformity to the in- 
tended program strategies. V I 

Examples: .A 1976 evaluation of Title IV of thff Civil Rights 
- Act was^used as the basis for developing a set of twelve <* 
detailed recommendations for policy changes. These were 
reviewed by the Commissioner of Education, and final regu- " 

Jations incprporating the recomnfendiations were Issued in 
July 19781 a 1978 evaluatidi) of the Impact of ESEA Title 
yll had a major influence oh decisions at the Department or 
Office level. The OE Coimiisioner noted during the 1979 
Senate Appropriations Hearings: .". . ,Sehator<, that stu<(y 
was very s ignlfl cart tly related to some major moves we made. 
The first move we made was to change the director of that ^ 
office. The second move we made, was to jcaJl an tntefnal 
audit that dealt with the staffing and program procedures 
in that office . . .. More tha"n that, Secreta'/y Califano 
established an Internal tracking procedure in which we are to 

•report quarterly s- t , [on] biliriguaV education to demonstrate 
to him at least four times a year that our*goals of increas- * 
ing the number of deficient children^iri bilingual educatiort 
waJ being met." 

A large cadre of program managers* wi thi n a government department are 
clearly Ithe recipients and users of evaluation research data. Those who 
immediatiely come to mind manage the bulk of ^e programs charged with ' * 
administering funds under a' variety of program categories to state edu- 
cation agencies, local education agencies, universities, etc. Evaluation 
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research conducted for these people may provide an input for decisions ; 

about needed program review techniques or additional program monitoring 

which is required. * 

Examples: The 1979 report on "OED Uses of Evaluation Acti- 
vities" provides several illustrations of program management 
use of evaluation data: Evaluation of the National Diffusion 
Network and related studies on school improvement have influ- 
enced strategies to place more emphasis on the quality of 
implementation, fidelity of adoptions, and the impact of 
the programs on learners in the adopter sites. A second . 
example: Studies have been completed for Upward Bound, 
Talent Search, and Special Services program for Disadvan- 
taged ^^udents. As a result, evaluation findings have been 
, used in the writing ?ind/or revision of regularities for the 
UB, TS, and SSDS programs so as to improve award procedures, 
overall program mangement, and monitoring and reporting ' 
procedures. 

Another audience for nationally cpnducteid evaluation research is 
the management personnel at state education agencies (SEA) and Jocal 
educatipn agencjes (LEA)*. Result§^ of national evaluations are provided 
to SEA and LEA officials; and it is anticipated that there will be SEA 
and LEA users. Evaluatipn research data from nationally conducted eval- 
uations may provide Input for|r^t^ng$tate or local educatipn agencies 
against other states or distri^s. Or, the evaluation research data 
might be used fbr making^ judgments with respect to the delivery system 
or instructional treatments used within the education agency. 



Example: Titlfe I is generally provided as the prime example 
of SEA and LEA evaluation use, Employing a required set of 
• eval uati on procedures enables State and school districts to 
make more 1 n formed^ compari son s of program outcomes f* Another 
exampl jB of SEA eval tiati on use M s provi ded ty/ the eVal uati dn? 
of state plans for career education / Based on^ the eval uati on 
data, OED reports that "Well over half of them [states] ^voV 
untari ly provi dedlpevi sions aihd/or addi ti ons in order to 
remec^y weaknesses which had been pointed out to them." 



*ln^hi$^paper v(e have on oohsidered "nationally conducted" evaluations 
.and no mandated" evaluations, such as the Title I evaluation 

conducted in the local school di stri ct . Thi s orientation is 1 n keepi ng 
with the Congressional Mandate for this study. , 

**It is not clear, however, whether mandated employment of the Title I 
TIERS evaluation repmrting system is, in fact, an instance of "evaluation 
use. " Must one firs tNdocument that these procedures provide outcome data 
which districts use fer decision making?. ' . 



Information heedsi and' the Tikelihood will be used 

are clearly related to a particular, user's organizational role. Thus,- 
pre-specification ot the anticipated evaluation user is imperative. v In 
this section, the various possible user categories have been indicated. * 
But, the question remains— for a particular evaluation, who are the 
individuals/audiences who are most likely to use {he evaluation? 

Evaluation report commissioners and secondary users . It has been 
said that "he who pays^the piper, calls th6 tune." And, House (1972) has' 
written: "who sponsdts and pays for the evaluation makes a critical dif- 
ference in the evaluation findings" (p. 409). The argument may be 
extended to evaluatior^ users: the source of the evaluation is an impor- 
tant factor in considering the potential user audiences and the extent 
of utilization. , . ^ . • 

The crucial distinctloni however, is not those who fund the evalu- 
ation per se, or;who pay the bil 1 , but rather , those who hire the eval u- 
ator, set the agenda, direct his actipifi, exercise control and oversight 
on the 6^^^^ etc. It is convenient to refer to these persons as 

VeyaiuatlOT report con^ 

set the context of the evaluation in tierms of what is considered accep-' 
table conterit^^ what questions! are to be answered, and even; which elements 
of the agency -are to be subject to scrutiny. If the ERCs of a federal . 
program are. themselves program managers, then^the evaluation will probably 
focus on" procedural aspects , bases for program modi fi cation _ ahd imjjrovement , 
deficiencies of^SEAs and LEAs in implementing the federally funded .|^rogram. 
Lik(BWise^;eya1 uaWs are 1 i kely to be less than candid jn their report 
abbiit major program charkteristiQS, the quality^^o 

wholie, or the quality of the program leader^ship. Partly, evaluators are, 
in fact , sensitive to "who i s calling the tune" (with political overtones 
of future employment, from the agency, etc. ). Also, as i de , from f he pol i - 
ti,cal aspects, there is simply the question of how the research agenda 
a limited: 'what are the allowable spheres, etc. 

' ' ' ' _' i' 

In this paper we have only addressed organizationally related user groups. 
Other iiser groups include "the public" or special interest groups, for 
examples. 



' ■ ■ ■ . I ^ 

• * ■ 

' To further Illustrate the extent to which the evaluation report 
cofflTOlss loner Is defined by "whp controls jthe evaluation," and^qt simply 
by such matters as who forma viy lets the contract or pays the bills, 
consider the following example. The Title' I Sustaining Effects StU(^y, 
while formally contracted/^y NIE and housed as a contract within NIE, 
wasnonpthe less mandated by Congess. MoreoveVj the Tegislat1on*al lowed: 
for frequent "Interim consultation with a congressional committee to discuss 
the progress and coufse of the evaluations, and provided for direct ' 
reporting of the eyaluation results to Congress without approval by NIE 
program heads. Even though the formal contracting occurred within NIE, 
it is certainly /appropriate to conceive of Congress (or at least the con- 
gressionaV commitjiee involved) as the evaluation report commissioner! As 
a side-ligKt7 this particular evaluation stuc|y is perhaps most lOften cited 
as among tj^^se most useful to Congress. 

Therefore, it i? important to consider how national evaluations are 
commissioned in order to understand potential ev use. Within the 

Office/of Education as previously constituted, evaluation activities were 
commissioned in at least th^'ee diffeicent locations. Some evaluations were • 
ini/tiated, supervised, etc. in .tH% Office of the Assistant Secretary for 
PTanning and Evaluation; the large bulk of the evaluations were commis- 
sioned through the Office, of Evaluation and Dissemination; -and some eval- 
' uat ion report commissioners v^ere to be found among the various programs of 
the Office of Education. Within the National Institute of Education, 
evaluation reports were commissioned, by and large, at the program level ... 

Sometimes evaluatiofis are irttended by commissioners principally to 
have impact on secondary users. Department leade^rship might commission a 
report to provide needed evidence about a program. Knowing', for example,, 
that congressional hearings will be shortly fort*oming on a particular 
program up .for renewal , the commissioned evaluation would very likely aim 
primarily to provide information (hopefully positive) to congressional 
users. Moreover, secondary users might be found within the same depart- 
ment, as for ^example, when an evaluation is commissioned within a program 
already congressional ly approved in^'Order to provide data to department 
managers lhat might lead to modifications! in the program. Many other ex- ^ 
amples could be provided of secondary user relationships. 



However, not -all evaluations have anticipated or possible secondary 
users. ' Sometimes the organizational role of tfie evaluation report commis- 
sioner and the spheres of inquiry in" the evaluation may limit its utility 
primarily to ERCs. The development of an evaluation around questions pri- 
marily of Interest to evaluation reports commissioners who had programs in 
the Office of |ducat1on may make the evaluation too" limited in scope for 
Congress as a potential secondary user. Or, by the same token, program 
manager evaluation questions, may be too aggregated and national in focus 
to be of use td SEAs and LEAs. Furthermcre, different user categories 
may. by the nature of the information they see. Impose quite different 
standards on appropriateness of evaluation methodology (e.g., a program 
manager may be quite satisfied with descriptive data as a source of infor- 
mation for program change while Congress or Department leadership are 
unwilling to accept data as .qorivincing wfrich does not employ anecperimen- 
tal design.) Indeed, these qualifications sliould not be construed as 
criticisms, for it may be Impossible to develop evaluations that fully 
meer thinrr^rmation needs and acceptable standards of evaluation data 
of a variety of users. # 

/ The point is clear: the information required for users at one level 
may in fact preclude important data for other users. The Title I Evaluation 
and Report!' ng Systefn (TIERS) provides an excel lent example. In the attempt 
to develop «n evaluation system to satisfy the information needs of, all 
levels of users (from the classroom teacher to Congress )v it may be that 
a system has been created that is not;^totally appropriate to any users. 
And, beyond^the reporting Ustem, it becomes even more obvious that the 
devel opment of l^eval uation report from TIERS data , of necessi ty , focuses 
on-one level of aggregation which diminishes the report's value to users 
at other levels. , 

To sjjm up, in considering potential users for evaluations, it is 
important t^examine who the evaluation report commissioners are, and 
the extent to which the evaluation has anticipated secondary users. The 
various management levefs within the Office of Education as wel 1 as Con- 
gress, SEAs and LEAs offer a wide variety of evaluation report conmis- 
sioners and anticipated secondary users. " 



Uses; What Constitutes Utilization? 

^- - - -^ — — 

The»re Is no -unified view of whether evaluations have Impact on what 
constitutes evaluation use. One bell eff well *icumented In the literature, 
contends that evaluations seldom Influence progratft decision making— and 
countless articles reflecting this stanc^r bej%«n the unlikelihood that 
evaluation will ever break through the t»arriers and have real Impact on 
programs^ tin alternative point of view, only recently expressed In* the 
literature, reaches quite a different conclusion: that evaluations do 
already Influence programs In Important and usefuV ways . 

The extent to which an individual propounds one orthe other of these 
viewpoints is largely dependent on the definition of utilization that he/she 
employs. The group which decries the lack of use would undoubtedly employ 
a very restrictive definition which would require that a^ single intended 
user (typicallly the .ERC) make a specific decision 1nied1ately folloi(fing 
the receipt of an evaluation report and heavily (if not solely) based upon 
the findings of that report. Altematiyely, it would be easy to finid 
great evidence of uti ' Vvi - with^^^ definition that encompasses any luse ' 
of anything from the eva 1 uation for purposes broadly conceived. ^ 

In our view, neither of these approaches .to defining evaluation i|t 11- 
ization is appropriate. Instead, they represent caricatures of defini^tions 
representing opposite ends of a continuum of views on utilization. Neither 
is workable; neither, is realistic. In the remiainder of this section, We * 
will examine a variety of views on evaluation use from the literature and 
attempt to derive,. a more appropriate definition. = « , 1 

'Other researchers have added substantively to the deliberations abput 
a definition of evaluation utilizatioh. Caplan, Morrison^ and Stan^augh 
(1975) have said: "utilization of knowledge . . . occurred wheniwe 5. 
respondent was f^lmi liar with relevant research and gave serjous consideip- 
ation" to an attempt to apply that knowledge to some policy relevant issue" 
(p. VII). Furthermore, these authors have contributed the concepts of 
"Instrumental" and "conceptual" utilization to the literature of the field. 
These concepts are further elaborated by references to instrumental use 
as where, respondents cited and were able to document the specific way ip 
which information was being used for decision "making purposes. ^ On the 



other hand| conceptual use refers to those Instances in which a policy 
maker's thinking was influenced by the evaluation or a. policy maker planned 
to use infonnation in the future. ^ 
There is general acceptance for including both instruHntal and 
conceptual dimensions within a general definition of evaluation use. In 
an examination of research on evaluation utilization, Conner (1980) noted 
^that five of the six major studies employed a definition of utilization 
which encompasses both instrumental and conceptual usage. (He concluded 
that with this broadened definition, usage generally was found to be 
high.) Knorr (1977) further extended this use category system by intro- 
ducing" the notion of "symbolic" use. ,Pelz (1978) draws the distinction 
between three different-types of symbolic use: use as a substitute for 
a decision; use to legitiinate a policy; and, use to support a predeter- 
mined position. The first of these types of symllblic iise does not i|ppear 
to' be an actual use of the evaluation and moreover, has been discussed 
earlier in this chapter (as %indow-dlressirtgV afld. "public rela- 
tions" evaluations). The latter* two symbol ic use types appear to involve 
a common theme--substantiat1ng a previously made decision or current point 
of view. ■ ■ / ':i 

Alkin, Dalllak. and White (1979) , yih their recent stut|y, have attempted 
to isolate the essential components of utilization and present their defi- 
nition of utilization in the form of a Guttman facet design sentence. The 
facets' include!' (1) the nature of- the client- (e.g., evaluation report 
coramlssibner) ; (2) the nature4f the use (e.g., one of multiple influences); 
(3) Diie type of use te.g.,/^king a decision); (4) and the topic of use 
(e.g., continuance of a^program component). The notions of identifying 
bdth primary and secojidary users and of instrymental and conceptual uses ' 
appear to be Encompassed within this definition. . . . 

More recently, Weiss (1980) has provided additiona^elaboration on a 
theme prevalen|:^in the literature by discussing the extent to whi^ 
rts^arch or .evaluation infonnation is utilized within systems primarily 
on very gradual bases, or, over long periods of time. The concept of 
."knowledge creep"— of incremental, temporarily gradual use of evaluation 
inforn^tlon--hal5 al so been discussed, to some extent , by Capl ah et al . 
(l^s/). Patton et al. (1975), Alkin et al. (1974), and Alkin et al. (1979). 



Drawing from these sources—the prevalent literature in the field— 
a definition of utilization is presented for purposes of^ this report. 
The definition is in the form of a simple matrix depicting instances of 
evaluation use, (See Table 2.1.) As seen in the table, the various 
categories of evaluation use are fairly obvious. Evaluation information 
may be used: to substantiate a prior decision, as Input to a current 
decision, or as part of general attitude ^inftfrmation. 

The first and third of these categories are fairly self-descriptive, 
the second requires additional clarification. Three subsets have been 
descif^ibed for the second category— "input to a current decision." First, 
evaluation information may be the primary basis for making a decision. 
It probably is quite naive to expect that policy decisions will be made 
based solely on evaluations; however, there are instances in which evalu- 
ation provides the primary information basis for pol icy action . Or, 
eval uations become mingled with other data Input—personal views of decision 
make rsj judgments ot political difficulty, etc.— to determine a policy 
decisioh. A third subcategory is perhaps difficult to distinguish from 
the second fand indeedi may be the same)----evaluation as one thread in 
the fabrit of cumulative^inputs over timis . Iii this instance, perhaps 
^here have been prior evaluation re^rts in prior years;* Possibly, the ^ 
evaluation information receivecl In the current year is just that piece 
of additional information whiqh stimiilates some incremental poliicy changes 
This is best;described by one of the participants in a conference on ^ - 
evaluation use in feder^^ - 

The way in whi ch e val uat i on cont r ib utes^ Is through t^^ 
water torture approach. Each study ad^ a^ T ihfor- 
• mation, only a little*bit, and after^a good many* studies you . 
^begin to.feel it a little more and a little more . . .. 
So in terms of the Impact of eivaluation on broad program 



di rection and poWcies it has 
and those who ask which stu4y 



that kind of cumulative effect, 
led to the termination of a par- > 
ticular 4)rogram,, just- don ' t understand either decision makinq 
' / or evalation. |Chelimsky, 1976) 

it may well ]be that this subcategory is most pervasive andi^includes most 

^ of the instances of documented evapuatio^ us^ for policy decislor^aking. 
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Table 2.1 
EVALUATION USE MATRIX 



Substantiate 
a previous 
decision* or ' 
point (^f view 



Primary , 
basis for 
a decision 


One of 
multiple 
current 
inputs for 
a decision 






• 

1 ^ ' 





One of 
mMltiple 
cumulative 
(temporal) 
inputs for 
« decision 



Instead of attempting to provide examples for^earh^Jifjthe cells of 

the evaluation use matrix, several Examples should provide sufficient 

elaboration to make the definition workable: 

Example— Congress, as an Evaluation Report Coinnissioner, 
uses Title I sustaining effects data along with the 
results of Title I evaluations in, previous years, the 
views of constituents, and other testimony to refund 
and make ch^ges^ in the Title I program. 

Example— The findings of OE studies to identify effec- - 
tivie projects in compensatory education was the pri- 
. mary basis for the decision as to which of these 
projects is to. be included within the National Diffu- 
sion Network, 

Example--An evaluation of Title'VII provided data which » 
was the primary input for the decisis by Congress as 
an influenced secondary user to change the Title VII 
eligibility requirements. 

SiCTnaty ^ 
In this paper various categprifeT'bf^Jsers have been described along 

^with the distinction l)etween evaluation report commissioners and secondary 
users. Furthermore, a iretrix has been presented explaining conditions 
xon^ituting evaluation "use. A conceptualization of evaluation users* and 
evaluation use\such as .this raises a host of procedural arid politically: . 
related issues. , It will be important to recognize the compl icating.f ac- 
tors in the federal system which inhibit utilization and vary from this 
conceptual schema. As already noted, the choige of usier audience carries 
with it implications for the way in which the evaluation is to be conducted 
Furthermore, the choice of user and appropriate use has implications for 
the organizational structure of evaluation services within a department. 

'For instance, some organizations (e.g., central izefd) are amenable to 
satisfying one kind of user need (e.g., .department management), but are 
not at all conducive to others (e.g., program managers). 
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CHAPTER 3 

I 

WHO CONTROLS EVALUATION? 
• THE INTERORGANIZATIONAL COMPLEXITIES OF EVALUATION RESEARCH 

. Robert K. Yin 
Massachusetts Institute of Technoloay 

£ : '■ . 

7 ' V^. HOW EVALUATION RESEARCH IS REALLY ORGANIZED 

^ ' 

- Evaluation research teams have perpetrated a myth about themselves: 
that researchers alone control the quality, usefulness, and relevance 
of an evaluation study. The'rnyth is reflected in the coimion remedies qiven 
for iraprovinci evaluation research. We are told that, if only'' the 
research was designed or conducted more carefully, the study might have 
been better (e.g., Berryman and Glfennan, 1978). More technically, this 
advice is often translated into modified research designs, the search for 
better measures of educational performance, and the recruitment of more 
qualified. and experienced research personnel. * 

This myth has been amplified by most evaluation textbooks as well as 
by the implicit norms of policymakers. Among typical evaluation tjexts 
(e.g., Rossi et al., 1979), the scope of coverage ineludes concerns 
about the research: its technical design and the ways of reducing threats 
to reliabre and valid findings. Vej^y little is said, in most texts, about 
the degree to which the research team may or may not control these facets ^ 
of the research; the issue is rarely even addressed. Similarly, among 
the implicit norms of policymakers, the ways of improving the quality and ^ 
utilization of evaluation studies are assumed to be matters pf researclj 
techniques. Thus, for instance, the U.S. General Accounting Office (GAO) 
fs charged With identifying improved evaluation methods— the assumption 
leing that such methods need only be irtplemented by researchers in order 
For the state-of-the-art to jmprove (e.g., U.S. General Acc&unting Office, 
1978). 

In fact, the outcomes of evaluation research are not completely 
:ontrolled by the research team. Instead, every evaluation study must 
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be regarded as a complex, interorganlzational affair^ involving at leait 
three parties: 

S 

V 

•A research team, Usually located in a university or an 
independent researrcf^ organization; ^ 

• The practitioners operating the action program being 
evaluated, usually located inlpederal, state, or locaT 
levels of government; and 

• The officials sponsoring or funding the -evaluation study, 
often synonymous with the officials funding the action 
program,' and usually located in a federal agency. 

For the purposes of further discussion, these three parties will be 
considered the reaearch-team^ the action agency % and the eponsoving 
agenoy. The purpose of the following paper is to show how all three 
parties can be said to share the control over an evaluation study, and 
thus how any improvements in the quality or utilization of evaluation 
resea^rch will require coordinated efforts--and not just actions by the 
research team. 

A Contrast: Traditional Academic Research A 

^ Before discussing the\comf|lexity^^!^f^^^6^^ the 
present paper .shoulddarifj^ the origins of the myth. These are 
embedded in the traditional organization of academic research; in which 
a research team does indeed work independently of the other two parties. 

In traditional academic research, the research team generally 
decides what to^stydy and how the research should be conducted. In 
some cases, the research may Involve an action site—e.g., a classroom, 
a school administration, or a governmental program—from which data 
will be^oUected. However, these action sites are selected by the \ 
researcJiers on the basis of the intended^research design, and the 
participation of the action sites naturally depends on their willingness 
to cooperate. Often, a research team may be refused access to a par- ^ 
ticular sfite. But when access is granted, it is on the basis of a 



mutual and voluntary agreement between the research team and the action 
agency. For this reason, the traditional model does rightfully focus 
on the primacy of the research team's role— there may be no action 
site, or when, one exslsts, the^actlon site's participation Is decided 
on an Individual basis and Is not usually part of any broader p»;or^ 
grammatic context. 

Similarly, the role of the sponsoring' agency In traditional aca- 
*dem1c research Is minimal . The sponsoring agency, often .making a grant 
award to the "research team, takes no greater Interest . In the research • 
beyond some measure of administrative aecoiyitablllty and research 
success— usually taking the form of nominal -progress reports followed 
by formal academic publications. In this part of the relationship, the 
research team may actually know very little about the sponsoring agency's 
bureaucratic environment and procedures; knowledge of these issues is 
further buffered by the university within- which the research team operates 

In the traditional-model, of doing research, then, the research 
team does mainly control the research. The design of the research is 
created and proposed by the researchers, the cpoduct- of the study is . 
fully under their control, and any probJems with, the quality or yseful- 
ness of the research can be correctly attributed to the skills of the 
research investigators. For this reason, textbooks aimed at iiriproving 
the research design of various type| of studies, or at de.veloping better 
instruments and measures, are appropriate w^jys of improving the research. 

The Inappropriateness of the Traditional Model to Evaluation Research 
This very situation. In which the research team is the prime and 
generally only actor in the conduct of academic research, is InappllcaMe 
to evaluation research. This conclusion is based on four observations; 

First, the reseccrah team must wovk with a specif ib set of action ' 
agamies. The designated action agencies -are, of course, tbosfe Involved 
in the program being evaluated. However, their participation in the 
research may not be voluntary, and whether they feel threatv?ne'd by the 
research team or not, considerable efforts roust "be made— during the 
conduct of the evaluation study— to develop a workable- relationship ^ 



between the research team and the action agency. More often than not, 
this workable relationship is based on a set of quid pro quoa, of which 
the following are examples: 

• In return for access to agency documents, the research 
teMUiayLhaJie_to-col4ec^t--cei^ — 

to the evaluation study but needed by the action agency; 

• In return for using the action agency's facilities, the. 
research team may have to use its computational facilities 
to produce information for the action agency; 

• In return for the action agency's 'participati on in the 
study and review. of the results, the research team may 
hav|^to assist the action agency in preparing one of 
ite proposals for federal funds; and 

• In return for using the time of the action agency's 

staff, the research. team may have t6 provide technical assis- 

* * . • 

tance, of an Informal , nature » to the action agency. ^ 

As any evaluation rese^archer knows, this list can be quite long. More 
important, the Success of the research has become increasingly dependent 
upon the workability of this relationship. 

Second, the eponeoving agency often pIMye a moo or role in setting 
the oonditians' for. doing tlie.reeearch. There are si ti^ations where 
research teams do initiate the^r own evaluation studies (e.g. , see the 
studiefs revtevied by Bernsteijl and Freeman, 1975).' However, in most 
large-scale evaluations in education/ the studies are "procured" by 
the sponsoring agency (Sharp, 1980). This means that th6 sponsor;ing 
agency sets the major boundaries for the research, including: 

» The overall level of effort to be expended in the research • 
(note ^ that in traditional research, this level is 
is determined by the research team in its original proposal); 




• The scope of work; 

• The types; of Issues, research design, and measures 
that are to be used;- and 

t The timing of ^various phases of the research and 
deadlines that etre to be met. 

The research "team, of course. Is not a completely passive actor In 
determining these conditions. But the Increasing expllcltness of the 
requests for proposals (RFPs) that are currently Issued by sponsoring 
agencies means that the staff of the sponsoring agencies have been 
Increasingly designing the "technical" aspects of the research to be- 
done. 

Third, the 8ponaoxn,ng agency and the action agenoiea often impoae 
. timite on the researah through the design of the action program. One 
common occurrence Is for the action sites to be select^ on grounds^. 
Independent of research considerations— e.g. , political anlKadmlnls- 
tratlve criteria. For Instance, In federal programs, a regional. d1 s- 
trlbMtlpn of action sites Is often the result of a political choice; 
but this choice constrains the nature of the ultimate reserch design. 
Other decisions aboiit the implementation of the action programs— e.g. , 
the staggered timing for initiating work at the aci ton sites— also 
affect the evaluation study; in this case, the research team may be 
unable to gather u(j1form "baseline" data or to conduct the research in 
as efficient a manner as possible. These and other characteristics of 
the action program, then, may all have an implicit effect on the 
"techrticar'-aspects of an evaluation study, but the conditions are set. 
by the sponsoring and action agencies,, and not the reiearch team. 

Fourth; there has been an increasing fragmentation of responaibilitiee 
within the apomoring agency: At least three parties, all Within the 
sponsoring agency, may have some Influence over the design and conduct 
of the reseaVch. These parties include: • 



t The official "project monitor for the evaluation study itself; 



• The program roonjt^^ responsible for Implementing the 
action program; and 

• The contracts office within the sponsoring agency, often 
dealing simultaneously ^it\\ all of the other parties 
within and outside of the sponsoring agency. 

A research team must learn to deal with all of these parties. Sometimes, 
compromises must be reached because the evaluation project monitor* and the 
action program monitor are both the audlencesr'for the evaluation study. 
Other times, the contracts office can create difficulties by requiring 
the Approval of speclfl^actlvltles within the action program or within 
the evaluation study, but then delaying action to such an extent that 
the research team must make further modifications in its original 
research plan. 

In fact, this final observation regarding the fragmentation within 
the sponsoring agency suggests the full organizational complexity of 
conducting eyalua^tion research: Three types of agencies and five 
relevant parties must all collaborate in order for the research to 
be done. These relationships are shown in Figure 3.1. 

. ' ' A ' 
Figure 3.1 
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General Observations 

The Interorganlzatlon^l relationships just described, and the 
complexities shown In Figure 3.1. all show how evaluation research can 
no longer be considered along the same lines as^tradltlonal academic 
r^j^earch. Moreover, Figure 3.1 has merely Indicated the major roles 
that^re Involved In the organization of an evaluation study. In 
numerous specific Instances, the array of actors can become even more 
diverse. For example. In some evaluation studies. CongreBaional staff 
oi* mmhevB may be a key part of the audience for the evaluation results 
As another example, a full description of the action progr^am might have 
to show the presence of ootmmnity gvoupa. often given ah official role 
In monitoring the action program and the evaluation study, or the presence 
of a technical aaaUtame contxHxotor, whose res|ons1b11 Ity Is to help 
the vtlon agencies (there may even, under/fcertain circumstances, be a 
technical assistance contractor to help thi sponsoring agency). Clearly 
the list grows. ' . 

What all of this means Is that our concerns with the outcomes of 
evaluation research-i.e.. the quality, utillzatlon.-and relevance of 
the rese^»rch-are not solely controlled by the research team. Textbooks 
and policymakers that ignore the complex Interorganizational relation- , 
ships are therefore Inaccurate in suggesting that improvements in " ^ - ' 
evaluation research can be made by the research team alone. Concerns^ 
such as "if only a better research design had been *ised . . must be 
considered along with other, equally relevant concerns, such as "if 
only the contracts office in the sponsoring agency had approved the 
action agency's budget in time. . ." " 

. To illustrate the effects of this interorganizatfonal complexity in 
terms of the quality, utilization, and relevance of evaluation research, 
the next section of this paper describes a few contemporary conditions 
within which evaluation research/must be conducted. 



k B . ILLUSTRATIVE COMPLICATIONS 

I, 



The *1nterorgan1zat1onal complexity that underlies evaluation 
research can affect evaluation studies at every major stage of their 
development: problem definition and evaluation design, conduct, and 
dissemination. The Illustrative situations described below have been 
encountered by any number of evaluation studies In education,* and are 
thus described In general terms only. Each situation represents a 
complication Imposed by the' fact that several parties, and not just the 
research team alone, liave to be Involved in making the critical decisions. 

Designing an Evaluation Study 

One prominent complication occurs at the '^design stage. First the 
spon|pr1ng agency and action agencies may limit the'range of relevant 
evaluation designs through the desigh and implementation of the action 
program. Second, and more important, the basic evaluation design is 
described, often in great detajil , in the Request for Proposals (RFP) 
that is Issued by the sponspr-ing agency as the initial step in sup- 
porting an evaluation study, j 

Contemporary RFPs often specify the sites to be studies, the data/ * 
elernents to be^analyzed, and the time Intervals for different data / 
collection steps. In short, the RFPs can dictate the entire scope pf 
the evaluation desigh. -Jn response, proposing research teams may / 
attempt some modifications. ' However, the major modifications thaywill 
occur, if any, are likely to occur after an evaluation award has/ been 
made— when the sponsoring agency and research team are more openly able 
to agree about any shortcomings in the original RFP. / 



Many specific illustrations are found in the full text/of the 
Committee report. / 



The final research design of the evaluation study, then, Is a 
function of: (a) the nature of the action program as It has Ijeen 
Implemented," and (b) a negotiated settlement between the sponsoring 
agency *and the research team. What this me«ins Is that If one Is 
Interested in Increasing this aspect of the quality of evaluation 
research, far more Issues must be addressed than the mere stipulation of 
methodological choices. Guidance Is needed concerning such steps as: 

f Conditions In the design of the action program that might 
negate the ability to do evaluations of minimal quality; 

• The process whereby an RFP is written and reviewed, and 
the staff persons who are involved in these activities; and 

f The ground rules for any negotiations in creating the 

final design. ' ' 

Of particular interest, among these steps, might be further inquiry i|iito 
the training and background of the sponsoring agency's, staff that is | - 
responsible for issuing RFPs. Often, one suspects tha^/ these 1ndi-| 
viduals, who may haye great influence over the design ^f an evaluation 
study, are^either inadequately trained in evaluation or inexperienced 
in conducting evaluations. Often, such staff persons believe thaft /a 
"textbook" version of an evaluation can be done,/^nd they fail to I 
recognize the actual political or administrative realities in doingj . 
the evaluation/ i 

Getting the Evaluation Donev I 

Similarly, the conduct of an evaluation is more complicated t^an 
the collection and analysis of data by the research team. For ih^tpnce, 
'several parties may have to review the data collection instirumentsi pro- 
posed by the research team. The review can range from full, formal approval 
by the FEDAC forms clearance process to a less formal review and approval 



The author knows of no list, ror_ instance of types^ of actiojfi programs 
that cannot be evaluated. Yet, some do exist and ought to be recognized 
as such» I 



by the sponsoring agency. In a few cases, the action agencies may also 
have a part In reviewing the 1nstruinentS"-an activity usually conducted 
by a "user panel" that has been established to advise the research team. 

^, Under these conditions, the process of Instrument' development again 
becomes a negotiated process. The final prociuct, both In scope and 
depth, can represent a compromise among competing priorities and. cannot 
necessarily be regarded as the best state-of-the-art froni; a research ^ 
point of view. Again, to Improve the quality of future evaluations, 
guidance is not Just needed on the techniques of Instrument design* 
Systematic information is also needed, for example, on how the FEOAC 
process can be conducted more smoothly and 1n a more timely fashion— a 
responsibility that often Involves key staff persons in thfe sponsoring 
agency. Similarly, guidance may be needed on the fair limits of non- 
research priorities— e.g., how far the research team should go to incor- 
porate questions of Interest to one of the other parties but not 
critical to the evaluation. - 

Reporting the Evaluation Results 

The textbooks tell us that, for utilization purposes, the production 
of evaluation results should occur in a timely fashion. Usually, this 
means- that the results should be made available Lhen key policy decisions 
are being considered. 

Not surprisingly, the research team does not have full control 
over this stage of the evaluation, either. Although the research team 
may have tried to keep to a policy- relevant schedule, the final results 
must also be reviewed by the sponsoring agency (and sometimes by the 
action agency) before a report can be made public. Delays can certainly 
have occurred' In the conduct of the research. But the final reporting 
of results can alsa be delayed by the action of these other two agencies. 
Sponsoring agencies may be especially susceptible to cumbersome and 
lengthy review processes. For Instance, the draft report may be shown 
to a wide variety of Individuals within, the sponsoring agency, all of 
whom may have a different point of view about the evaluation or the 
program being evaluated^ Under such conditions, the research team often 
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has difficulty In merging the various comments Into f coherent pattern, 
adding to the time and effort needed to review the final report. 

Independent of the Issued of whether sponsoring agencies may pur- 
posely, for political reasons, delay the Issuance of final reports, 
there are few guidelines for a minimally acceptable review process, 
One of the notable gaps In mostRFPs and proposals Is that the sponsoring 
agency's review process for, draft manuscripts Is In fact not described 
at all (usually, the RFP merely stipulates, that the review will occur . 
within a particular time period). Are sbme review processes more desi- 
rable than others? Can the review process be streamlined so that the 
evaluation results can really be reported In a timely fashion? These 
are some t)f the Issues that still need to be addressed and that again 
go beyond either the technical aspects of doing evaluations or the 
full control of the research team. 

Summary 

These few Illustrations should be sufficient to Indlcate^bie 
degree to which evaluation research Is a joint, Interorganlzatlonal 
enterprise. As such, any attempts a%lmprovenient must not only be 
focused on the research team and Its technical methodologies, but also 
on the capabilities of other relevant parties. Including the staffs of 
the sponsoring agency and action agency. ' 

For example, rarely has one heard any debate regarding the ways In 
which RFPs should be written or even by whom they should be designed. 
(For a modest beginning, see Weldman, 1977.) Yet, from the standpoint 
of Improving evaluation research in education, changed In the RFP pro-, 
cess may be more Important than any potential chaniidl In the capabilities 
of the research team. Perhaps It Is even time, for a textbook on how to 
write RFPs and how to monitor research, or even On the development 
of minimum standards^ regarding those staff positions in the sponsoring 
agencries— to appreciate better the role, that such staff can have in 
affecting the quality, relevance, and utilization of evaluation research. 

■ . ' / I 

ERIC 



S1ni1,Ur1y, this may now be a good time for further reviews of all 
the'b«irr1ers confronting the conduct of evaluation research, Including 
the role of FEDAC clearance, A few years ago, one research" Investigator 
did attempt to catalog all of the "regulations" that affect research 
(Gandara, 1978) i the review revealed a mine field of potential barriers 
and problems. If evaluation research has become over-regulated, changes 
must be con-templ ated In the regulatory environment and not Just In the 
technical aspects of research methodologies. 

Finally, It Is now clear that "successful "evaluation researchers 
are those who are able to manipulate the Interorganlzattonal complexities 
that have been Identified within this paper. For Instance, Paul Hill, 
who conducted a successful evaluation of the compensatory education 
program, wrote that the five Ingredients for success Included the 
following conditions (Hill. 197E(): 

1. The evaluation was aimed at declsons^that *'users" could 
make. 

2. The evaluation was conducted In open consultation with 
potential useirs, 

3. The evaluation rfecognl zed that research Information was 
only one source of Information that wou^ld be available 
to usfrs. ' • 

4. The evaluation results allowed for divergent .posltlons^ 
and values. 

5. The results of the evaluation ^ere produced in a timely 
manner, to feed debate about the action program. 

What Is surprising about this list of major lessons from a successful 
evaluation researcher Is that not one of the lessons. involyed the 
technology of the evaluation. Each of the lessons, on the contrary, 
covers some aspect of Interorganizatlonal relationships, showing how the 
research team must be prepared to manage sucK relationships. 

Myths die hard. The purpose of this paper will have been served if 
we no longer think of evaluation research as being organized like traditional 
research. Evaluation research is not done solely by a research team and 



and therefore carfnot be controlled by the research team alone. The 
staffs of other organizations ^ as well as tHe nature of Interorganlzatlonal 
relationships, all become Important to the design and conduct of an 
evaluation study. Recognition of this complexity should lead to better 
Insights on how to Improve evaluations. 
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CHAPTER 4 ^ ; 



EVIDENTIAL PROBLEMS IN SOCIAL PROGRAM EVALUATION AND THEIR 
ANALOGS FROM ENGINEERING AND THE PHYSICAL SCIENCES, MEDICINE, 
AND AN ASSORTMENT OF OTHER DISCIPLINES, BASIC AND APPLIED, 
TOGETHER WITH SELECTED HISTORICAL REFERENCES'^ 

Robert F. Boruch 
Northwestern University, Evanston, Illinois 

* 

1. Introduction 

This paper concerns the problems that are engencfered by efforts to 
collect evidence about a problem or a proposed solution. The special - 
focus Is on problems common to both social program evaluation and to 
evaluation In other arenas, notably the physical sciences, engineering, 
medicine, and occasionally commerce. Not all of the problems are new, 
despite contemporary arguments over what to do about them. And so the 
paper is studded with references to early pertinent work. 

I have several reasons for developing a disquisition of this sort. 
In the first place, science in the abstract recognizes that the diffi- 
culties of accumulating decent information transcend discipline. But 
the lay public and its representatives and analysts in sundry disciplin- 
ary camps often do not. The fail urerto apprehend that the same problems 
occur in both physical and social sciences. Indeed that many are very 
durable, is n bit s*hameful It results in social research's being con- 
strued as ipbre feeble than work in other, vineyards. It is. more feeble 
in some mpects. • It is at least as robust in others though. And crude, 
comparative work of this /^ort may help to illustrate the point. 

The more immediate, fess rhetorical, feature of the motive concerns 
the responsibility of agencies, such as the U.S. General Accounting Off I c 
to oversee performance of gdyemment in a variety of tasks. The problems 
that these agencies encounter are often statistical and scientific at 
their core, though jthejNre infrequently labelled as^ such, and common to 
several disciplines.' T/his papeMay help to remedy this problem as well. 



A second motive Is more personal. The writer is a metallurgical 
engineer turned social scientist. The vernacular differences I encountered 
in stwnbling from one field to the other are tedious at best. The social 
program evaluator's "formative evaluation" is no different, though perhaps 
more obscure for good or ill, from the engineer's "trouble-shooting" or 
"development." At worst, they often imply notional differences, between 
qualitative and quantitative, Subjective and objective, that are often 
gratuitous even obstructive. In this respect, the spirit of the paper 
is akin to others, notably Flonnan*s Existential Pleasures of Engineering . 
It is more a catalog than intellectual history or <lialectic between camps. 
But if it succeeds in stimulating better understanding of the nature of 
such problems, one of its objectives will have been met. 

The examples illustrate failuir^ of scientific knowing or of common 
sense, little cortical collapses. They are not intended to demean research 
1n the physical sciences, medicine, or business: The point is that the 
problems are persistent, and we ought to appreciate their appearance in 
a variety of human enterprise. 

2. Implementing Programs and Characterizing Delivery 
It is something of a truism that social programs are never delivered 
as advertised. The_s^cial scientist often finds it very difficult to assure 
tfiat the program under investigation has the form that it is supposed to 
have* Moreover, it is often difficult to mbnitor the discrepancy between 
plan and its acttializ^tion syite^tically. The problem* is persistent in 
evaluatinig complex, broad aim efforts, suph as Model Cities Programs * 
during the 196Q's (Pressman and Wildavsky, 1973). It is characteristic 
of newer evaluations, including those directed at programs which are said 
to be well structured but structCire depends heavily on individuals' follow- 
ing marchin^orders. ThesyJ^Bften do not or cannot. See, for instance. Fair 
weather and Tornalsky (1977). On evaluations in mental health, Kelling (1976) 
on police research, Sechrest and Redner (1978) on estimating the effects . 
of innovative criminal rehabilitafton programs, and Rossi (1980) on educa- 
tion, welfare, arid other programs. 

Laboratory research is not spared the problem, of course, though its 
severity and measurability differ from the field variety. Grey jminence 



L. L. Thurstone encountered military trainers who ga«e Instruction secretly to 
control group telegraphers in the Interest of assuring they got tht benefHt 
of sleeR learning (Mostel ler, 1978) . Partly on account of such e&r'iy diffi- 
culties, it is comnon practice in social psychological research to check 
treatment manipulations. The measurement and reporting problesns apply to 
methodological" research on improving cooperation In mall, surveys (sea 
Sood,vl978, on Christopher Scott), to educational evaluation (Leonard and 
Lowry, 1979). and to applied social research on o^er topics (Boruch and 
Gomez, 1979). 

The problem is not conft(ied to evaluatiok^f social programs. It 
appears in the engineering. sciehces where, for example, allegations that 
reliability of control over variables affecting reactor cooling systems 
have been a grave concern (Primack and von Hippie, 1974). The control 
problem ^tn some chemical processes has been sufficient to warrant V.V. 
Federov's flgvetoping new approaches to understandinq in randomized tests 
at Moscow. (Despite this there appears to be little attention to the 
problem in texts of experimental design in industry/) Bureaucracies Have 
simply forgotten to implement plans for pesticide control (U.S. General 
.Accounting Office, 4968) and to deliver vasectomy kits in fertility control 
program? (Sul 1 1 van^ 1976) . They have denied the existence of treatments 
or mislabel led* them: Recall the U.S. Deffense Departanent's denial of the 
use of poison gases at the DUgway facil ity. TheSmroblem is impl icit in 
early agricultural experimentation as well if we-*:H^e corrcfctly from / 
Yates' (1952) concerns about correction of bias in moving from laboratory 
versidns of fertilizer application to field studies. It is also burie^ 
in the histoiry of manufacture, 1 ncl uding the procjucti on and adiil tieratl/on 
of foodstuffs: recall Accum's treatise for the 19th century consumer// . o 
Lest the blame be laid on Institutions, recall that the odds on. being treated 
by the'plTl- advertised on the label are 9 to 1 according to the Food and 
Drug Administration. What happened to the remainder is not known. / 

There are more than a few interestirig paralleU 5>etween evaluation of 
social programs and meteorological studies of the past ten years, fudging 
from Braham( 1979), Kruskal (1979), Flueck (1979), Crow, et al. (1977), 
Neyman (1977)^ and others. The commonalities are especially evi<fent frc 
randomized tests of the effects of cloud seeding on precipitatioi. Pilots 
who were responsible for stading silver iodide crystals had their own 




preferences about where to fly, notably in sight of the coastline for the I - 
Israeli experiments/ Decisions had to be made about whether to shift the 
target accordingly. Spillover of seeding or contamination of neighboring 
clouds Is a threat to the validity of inferences in these studies, just as 
it is in the social sector where children not assigned to special education 
may receive it anyway from well-intentioned teachers. Seeding flares in 
early experiments in Florida were imperfect just as nutritional supplements 
were in the early Colombian experimertts, on the supplement's effect on ability. 
Measuring the level of imposition or of receipt of treatment seems to be 
no less difficult here than in the social sector. Indicators of intensity 
of treatment, for instance, are sometimes crude, e^.g., recprded^^^^d^^^^^^^ 
seeding and mean wind speeds in the target .area. Reliably indexing cloud 
conditions is all but impossible on account of their variability, and this 
problem is analogous to the chronic one of assaying the local conditions 
that may affect delivery of welfare services, educational TV, or income 
transfer payments, in evaluating social programs. 

The unwillingness ojnl^bility of field staff to adhere to regimen 
demanded by a new social program seems npt much different from t|ie reluc- 
tance evident in some test:s of medical innovation.^ For instance, attempts 
to determiije whether conventional , enriched-oxygen environments for treat- 
ment of prematulre Infants actually caused blindness met with remarkable 
resistance from some nurses and physicians. The latter were unable to 
countenance depriving infants of o)^ygen, though subsequent reselirch denwn- 
strated that o)^gen was indeed Influential in producing blindness (Silver- 
man, 1977). The difficulty here parallels earlier ones, encountered by ^ 
British Amiy Surgeon Genera) John Pringle and others who attempted to 
"reform the sanitation practice of hospitals (Marks & Beatty, 1976). The 
problem also extends to well-trained specialists where, for example, the 
Integrity of an operation such as coronary bypass Is var'lable judging by 
indices such as perioperfitive heart attacks, graft patency, and crude 
hospital mortality rates (Proudfit, 1978). A similar problem, in less 
obvious form, emerges when one considers the material used in tests of 
vaccines and drugs. Confirmatory tests of polio vaccine were disrupted 
briefly by a product that Induced poliomyelitis Instead of preventing it 
(Meiej*, 1972; Meier, 1975). The Indian tuberculosis prevention trials 



iwere executed partly to determine whether effectiveness of vaccine, de- 
monstrated earlier to have been effective, had altered because of strain 
mutation and changes in antigenicity, variation in production methods or' 
in doseage levels (Tuberculosis Prevention Trial, 1979; mman, 1980). 

A cruder form of the problem involves receipt of treaimeret and adher- 
ence to regimen. For example, in the Kaiser-Permanente tes/ts of multi- 
phasic screening, many of the individuals assigned to the screening 
program failed to turn up for periodic examination. The research staff, 
interested in effectiveness of screening and not of natural turn out rates, 
mounted "an intensive telephone program to encourage participation in the 
free and presumably beneficial service (Cutler et al . , 1973). Similar 
encouragement strategies have been necessary to obtain interpretable esti- 
mates of the effects of viewing educational television. A good deal of 
the argument over the implications of the University Group Diabetes Program 
hinges on an identical problem— a minority of patients in at least one 
group appear to have adhered faithfully to the treatment regimen to which 
they were assigned (Kolata, 1979b). 

3. The Odds on Success and Failure 
and Uniformed Opinion 

These were the generations of Budgeting 

Planning-Programming-Budgeting begat ^Management by Objectives 
Management by Objectives begat Zero base Budgeting 
Zero base Budgeting begat Evaluation ' 
Evaluation begat Experimentation 
Experimentation showed that nothing works. 

From A. Schick^ Deuteronorny. 
The Bureaucrat . 1976. 

the concern that innovative social progams will fall is justified. 
But the expression of that concern is often pessimistic, occasionally 
alarmist in some camps, wildly optimistic in others. At George Washington 
University, for instance, we Were taken aback by the plaint that evaluation 
is discouraging to the public, bureaucrats, and politicians because |)ositive 
effects appear infrequently, and so it should be trimmed. In one of 
Patricia Graham's public addrejsses as head of' the National Institute of ' 
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Education, she took pains ^to recognize declining agency to rale and attri- 
buted it partly to the conduct and results of contemporary program eval- 
^uations. Pessimism fs not an uncoipmon theme io the academic sector either. 
Here, it is easy to find comfortable cynicism about the lack of good evi- 
dence, and occasionally, the judgment that because evidence is poor in 
quality, programs are also poor in quality. 

One problem, of course, is to determine when the pessimism is warranted. 
We believe it is generally -misleading as the incontinent optimism of early 
programs. In particular, che vague negative viewJs not well justified 
5impTy because we do not yet have reliable Information on the relative 
frequency of failure, success, or mix6d results of new projects. The short 
history of evaluative policy and briefer development of Competent field 
testing account partly for the scarcity of data about odds. To Illustrate 
one approach to understanding this context, consider Gilbert, McPeek, 
and Hosteller's (1977) examination of hi/gh-quality evaluations of surgical 
innovation. Considering only well designed evaluajions, they find that 
about one -third of stich innovations are fairly successful relative to stan- ' 
dard surgery, a third are worse than standard, and a third da not diiffer 
appreciably In effectiveness relative to normal practice. As one might 
expect, similar problems have affected the introduction of -new drugs though 
current success rate is hot clear. For instance, a massive reevaluatlon | 
of the efficacy of drugs was undertaken by the National Academy of Science j 
following the 1962 Drug Aitendments Act. The report suggests that about 
7% of the drugs and 19% of\the claims were Ineffective (see Hutt's remarks, 
page 228, in National Academy of Sciences, 1974). The^most pessimistir 
estimate includes drugs that are only •'possibly effective" and drivet u ^ 
statistic up to 60%. Gordonland Morse's (1975) coarser review of the well- 
designed evaluations which have been reported in the sociologies)^ 11 tera'- |; 
ture suggests that 75% of the programs under stuc(y fail to detect any ^ 
improvement over comparison programs. If one admits poorly designed eval- 
uations in the t^alcu1ations\ the odds char^ge of course. Some examples are ^ 
given in the section on inept design of evaluations. 

No comparable efforts to assay likelihood of success have been com- 
pleted in education research and development. But a crude upper bound 



• might be obtained from statistics on projects that have passed muster with 
the Department of Education's Joint Dissemination and Review Panel (JDRP). 
The JDRP reviews evaluative evidence on projects submitted by project 
managers to (Jetermine if evidence and size of the project's intended 
effect are sufficient to warrant further federal support. It is a biased 

V sample of all such projects since submission to review is voluntary. 
About 60% have been approved in- recent years. At least one lower bound 
estimate for one category of projects is implied by a recent American" 
Institute of Research review of bilingual programs. Only 8 out of 175 
were judged to have sufficient evidentiary support to warrant approval 
(Boruch & Cordray, 1980). 

Judgments about failure rate in the social realm are often based on 
what appears to be the absence of failure or mixed results in others. 
So, for example, the critic may point to innovations in engineering as a 
remarkable standard against which social innovation do not fare well. 
That standard is misleading ip several respects not the least being general 
ignorance of failure rate. Ordinary bridges, for instance, do collapse. 
It was not until 1636 that the first quantitative treatment of stress in ? 
bridge structure appeared, written by Galileo. "Before his time the ' 
strengths and deformations of structures were determined primarily by trials 

. and error. A structure was built. If it stood up, well and good. If not, . 
then the next structure was made stronger where the first one failed, and 
so on" (Borg, 1962, 4). They failed at a rate of 25 per year following 
the Civil War. In the 1900's, bridges large enough to symbolize a new 
industrial age collapsed before completion because "large steel members 
. under compression behaved differently than the smaller members that had 
been tested time and time again" (Florman, 1976, p. 32). ' Suspension 
bridges'have stumbled since 1741 desoite their stately grace. The failure? 
recorded in 18th century Scotland continued in 19th century England and in 
20th century United States.^ The Tacoma Narrows Bridge7~which' failed in ' 
^1940 on account of progressively amplified wind vibrations, is a common 
illustrati^on in introductory physics texts. The spirit of that illustration 
also underlies examples of -flaws in the eval uation of Headstart, cited 
in graduate texts on (;j^sign_of, evaluations. The rules for making bridges 
robust gainst- anjpl if ierd vibration did not become c^ear until the 1950's. 
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^Resnikoff and Wells (1973) catalog exan^les and explicate thfe rules In their 
delightful mathematic text. The new ^arena for failure here appears to be « 
bridges remarkable for their lengthyor age. Up^til a few years ago, for 
instance, the Tampa Bay's Sunshine^^kyway had "ample^ c^arahce for even the 
largest ocean-going vessel." The/fact that the tallest supports *did not 
collapse when many of the rest did is ctirious but no help at all to anyone 
who wants to cross the bay by ^iito. Winds up to 100 mph swept away a size- 
'able chunk of the new Hood Canal pontoon bridge in Washington state, and a 
^ize fraction of the $30 milTibn investment with it ( Los Angeles Times , 
February 14,* 1979, p\ 1, 8)4 Maintenance failures and deterioration may 
hasten the demise of New York's Queensbp rough (59th Street), the Golden 
Gate, and others used a%'bad examples In Congressional testimony on the 
1978 Highway Act. / 

If we examine tl^e staXt up of businesses, we find prospects for failure 
no less formidable./ For 1978L the ratio of business failures to start-ups 
J was about 20%. Tl)is estimate \s a conservative one since Dur{ & Bradstreet, 
thfe repository ^dr such informari^, defines business failu/e as a voluntary 
action involvir)^ loss to creditors\r court proceedings—bankruptcy. D & B 
maintains "th^ every year several huHdred thousand firms /are started and 
almost an eq^al number are discontinued\(Dun & Bradstreet, The Business 
Record Fai>ure . 1979, p. 3). But confmercu^ enterprise/is certainly better 
off now t^an during the late 19th century. /The commercial death ra^te, as 
it was Idbelled at the time, was double the number of new businesses added. 
The ratio of failures, defined In terms of liability, to start-ups ranged 
troiulsX to 90%, or so -said Bradstreets*s (Stevens, 1890/1891). 

/The point- is that » contrary to the opinion one may develop based on 
anecdotal reports, narrow personal experience^or poorly designed evalua- 
jplons, innovations in a variety of areas succeed less'^than half the time, 
/and probably a good deal less than a thir^ succeed at the field test stage. 
/ Innovative educational programs may succeed at roughly similar rates when 
•properly evaluated. ^ 

Despite th occasional appearance of blg^bang effects, advances in 
any science are usually small. This makes designing evaluations which are. 
sensitive to smalVeffects very important. Given a design which provides 



some protection against competing explanations, anticipiting the likeli- 
hood that program effects will be detected, if they occir at all, is 
reasonable. But it is difficult to find formal power ahalyses in educa- 
tional evaluations, making it difficult to determine if the design was 
indeed sensitive. It is small comfort that the same problem, ignoring 
a fundamental technology affects medical research (Frelman, Chalmers, 
Smith, and Kuebler, 1978)' and less recent research in psychology (Cohen, 
1962). That the technology, even where occasionally Exploited, is often 
based on optimistic rather than realistic guesses abo|it program effect 
size is even less_ comforting (see Daniel ,1972, for ihstance, on indus- 
trial experimentation). 

\ 

N 4. Reliability and Validity of the Dab 
Mark T^ain, according to Mark Twain, was not telrribly bright. But 
he did have the wit to assay reliability and validity of phrenologists' 
readings of his skull and palmists' readings of his/paw. Some readings 
were wildly unreliable: Bumps interpreted one mon^h disappeared entirely 
or became dents on the second engagement. The mos^ reliable palmist 
appears to have averred repeatedly that Twain had sense of humor 
(Clemens, 1917/195?). Some researchers exhibit a' sturdier indifference 
to common sense. i 

Projects without much concern for quality of/ information include the 
Philadelphia Federal Reserve Bank's evaluation ofl the Philadelphia School 
District, the Federal Aviation Administration's Evaluation of the Concorde's i 
impact on communities in the airport's vicinity, land many of the recent / 
studies of the impact of desegregation. , For theie and other cases, estab- / 
lishing the quality of a response measure is essUtial for obtaining a 
decent description of -the nature of a social prbblem and for estimates of 
the effects of a program on the prbblem. I 

Especially when evaluations are used to infprm policy, the conse- 
quenc^ of ignoring flaws in the information caij be serious. In covariance 
analysts of observational data, for instance, s'imple random errors of 
measurement can' bias estimates of fJrogram effeJt. Under conditions commonly 
found in the field, the result is to make weak j programs look harmful in 
compensatory education (Campbell & Boruch, 1975) arid manpower trafning 
(Borus , 1979U md to adul terate -evidence-about discrimihatibn 



In court cases. Similarly erroneous conclusions may be dfawn in applications 
of the same method to basic research data on schizophrenia, for instance 
(Moodwarrf & Goldstein, 1977). The difficulties abide fpr anthropological 
disciplines as well as their more numerical sisters . Recall for Instance 
Llenhart's view that Darwin was misled into believeing Terra del Fuego 
natives were cannibalistic by natives who wished to be entertaining and 
cordial. (See Przeworski and Teurie, 1970, for illustrations and a 
bibliography.) 

It is not difficult to find analogs to simple problems of reliability 
of measurement in medical diagnoses. During the 1960's, for instance, 
well-informed physicians knew that simple tests for gonnorhea yielded 
false positives. One physician, not so well informed, managed to start an 
outbreak of mass psychogenic illness (contagious hysteria) among high 
school students by simply failing to read medical literature. Understand- 
ing the traps in simple tests led Mausner and Ge2on-(1967) to avoid rely- 
ing on vaginal smears alone and ultimately to their development of a 
remarkable case stuc|y of the episode. Measurement error in the response 
variable appears now in more complicated ways, judging from the University 
Group Diabetes Program. There, not a little of the ambiguity in evidence 
is attributable to the way diagnosis of cardiovascular disease depends on 
whether one conducts an autopsy. And, of course, the random instability 
in blood pressure, among other traits, . causes no end of argument about who 
is hypertensive and who is not, and about whether labile hypertension is 
indexed by blood pressure is merely regression to the mean or similar 
artifact of the way we measure or respond to measurement over time (Kolata, 
1979). The problem is a hoary one in medicine and well-^ documented at "least 
for illnesses such as^smallpox and measles. Still, it is a bit unnerving 
to stumble over examples: Citizen Graunt inveighed against the ''ignorant 
and carel ess searchers" who did not_accuraie.lx-enume.rate deaths in the 1600 • s 
His little catalog of ways that cause of death might be misconstrued (does 
a seventy-five year old man die of "the cough" or of old age?) is a rudi- 
mentary theory of misclassification (Graunt 1662/1973). 

Just as judgment about children may be influenced by teacher's ex- 
pectations, medical assessments are sometimes slanted by physician's 
expectations more than by evidence. ^ Recall that ia McEvedy and Bear's ^ 



(1973) study of neuromyasthenia , symptoms similar to those exhibited by 
victims of poliomyelitis were also exhibited by physicians and nurses 
without the latter disease. The physicians regarded neuromyasthenia as a 
clinical syndrome when indeed the problem was psychogenic. Barnes' (1977) 
fascinating review of worthless surgery is also pertinent here. He 
reminds us that ptosis was characterized early in the 20th century as a 
condition in which the position of internal organs was "abnormal." Sur- 
~ ge6ns thought the abnormality caused a wide variety of symptoms. Kidney 
displacement, for instance, was alleged to produce neuroticism, batk pain, 
and vomiting. We know now that ptosis is not an organic problem, that sur- 
gery was unwarranted, and that diagnosis, and -etiology-were nonsense. The 
reader may think this illustration far fetched. It is not, judging from 
recent efforts to slice the incidence of tonsillectomies, hysterectomies, 
and adenoid^ctomies (see Dyck et al . , 1977, for instance). 

The eng-ineer has to accomodate problems of error in measurement 
top, of coursk And despite the awesome growth of the instrumentation 
industry, they I?e often no less severe. For 'instance, in the Handbook 
of Dangerous Properties of Industrial Materials / Herrick (1979) reports " 
•that the reliability of air screening is such that readings are within - 
' +25%- accuracy. This suggests that reports of environmental tests should 
routinely provide information about their reliabili%, just as one ought 
-to provide estimates of reliability of personality inventories, questiQ-n- 
■rtaires, and the like. The validity of environmental/ test results depends 
no doubt on local circumstances. And it's conceivable that the results 
- ought to be adjusted for these just as are standard measures that are 
influenced by temperature and buoyancy* in the case of weight. The diffi- 
culty of adjusting for temperature expansion can be traded to Michaelson's 
efforts to correct for thermal expansion in estimating the speed of light 
and his failure to correct for temperature influences on the index of 
- light refraction (Eisenhart, 1968). - r-r - 

Flaws in observation and measurement on a much l|rger scale are not 
advertised much, especially if they concern the military. But remarkable 
ones surface occasionally. Detecting an atomic blast, for example, is 
not as easy nor reliable a process as one mig|»t expect. The Vela sur- 
veillance satellite "saw" an explosion in 1980. What was thought, until 
^f^s" to be a unique signal,, associated with a blast, turns out not to be 
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unique* It is produced by peculiar confluence of natural phenomena as 
welK The influences on accuracy of measurement are as difficult to 
assay in military engineering. Part of -the controversy over the Airborne . 
Warning and Control System hinged on the airpla; suceptibility to 
attack and the system's lack of robustness agai ist sr;;'te-of-1:^e-Lrt 
devices for signal jamming (Sousa, 1979). And, of course, the interest 
and skill of individuals given responsibility for measarement plays a 
major role. The federal delegation of authority to state governments in 
the national dam safety program, for instance, resulted in data which 
"varied enormously in quality. Dams were missed entirely in their inventory 
hazards ignored, and data was inaccurate in other respects (Perry, 1979). 

Recognition of such problems in the sciences is not recent. Galileo 
had the sense to have the ball descend the channel repeatedly to assure 
that his estimates of acceleration rate were decerTt. Not mor^ than 30 
years later, Graunt (1662/1973) issued complaints about the indifferent 
quality of records available for political arithmetic. Over a hundred • 
years later, astronomer Simpson made the same point in writing about the 
need to obtain a mean in observations. Echoes of that advice can be 
detected in at least one electrical engineering text of 1917 and one chem- 
ical engineering text of 1938. (See Eisenhart, 1968, for a remarkable 
treatment of the top4<^nd foK' references to these examples.) As one might 
expect, there aria physical antecedents to contemporary debates over, difi- 
nitions of intelligence, ability, and the like. Yhe difference between 
the American ijK:h and the British inch, created by legal fiat in ^866, was 
small but caused no end of problems yntil 1966 when both were defined by 
agreement as 2.54 cm (Barry, 1978). 

The idea that there are important qualitative aspects to the problem 
of measurement error is not especially new.- A founding father of statis- 
tical quality control methods recognized it in thel930's, stressing that 
people, the physical devices, an.d other influences on measurement need to 
be recognized; His observations were presaged by astronomer George Bi^del 
Airy in 1861 who warned against "light" assumptions about presence or 
absence of constant error, and recognition of chance variation. The 
statistician Gosset (aka Student) recognized higher consistency among 
measures taken within a day relative to those across days, and speculated 



on the reasons for the phenomena in 1971 (Eisenhart, 1968). Mosteller 
(1978) notices similar structure in time lapse data generated during the 
1860's under the support of the U.S. Coest and Geodetic Survey. Working 
on Gosset's turf in 1956, Cunliffe (1976) found notable random variation 
and peculiar within-laboratory variation in measures of the volume of 
Gulness beer in bottles. This was apparently remarkable enough to justify 
"very delicate conversation" between Cunliffe and Guiness's chemist, from 
which each "retired, somewhat wounded." 

The little herd of theories and inventions which helped to improve 
understanding of the qualitative aspects of measurement in phys ' J 
engineering sciences seems not to have been matched in the sod. jr. 
But some relevant work has been done. In broadening his thesis o.i social 
experimentation, for example, Campbell (1975) espoused a side theory on 
corruption of social indicators. The idea is that as soon as it becomes 
well known that a meaiure is being jsed in making policy decisions, notably 
in program evaluations, the measure will be corrupted in some degree. A 
related idea'^ characterizes 14th ct-ntury India's Ibn Kalduii's observations 
on his predecessor's exaggeration of numbers in description. Numeric sen- 
sationalism exalted the status of historian and statesman then as it does 
now, and Kaldun's attributing the problem to lack of conscientious criti- 
cism seems no less pertinent now. During the same period, China regarded' 
the problem of suppression of facts in censuses as serious enough to jus- 
tify beheading minor officials (Jaffe, 1947). To get much beyond the 
idea, one must identify the main influences on corruption.' For Knightly 
(1975), in what must stand as a model of crude theory in war reporting, 
this meant tracing the quality of battle statistics, from the Crimean 
wars to Viet Nam, as a function of incompetejit journal ists, self-interested 
generals, self-serving politicians, and as a function of what he regards 
as a minority, the virtuous members of each camp. Sound misreporting in 
recent wars seems not to have impeded military careers of some, generals 
(Halberstam, 1969).^ 

• The scholars' Observations on corruption are clever and important. 
But it does seem sensible to recognize other persistent sources of dis- 
tortion. Indifference and inability may not be as titillating as corruption 
but they are likely to account for more of the problem. The indifference 
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was recognized by Graunt if^ we interpret correctly his concerns about 
London's ignorant and careless searchers. They are implicit in Barnas 
Sear's reservations, as Secretary of Education for Massachusetts^ about 
the quality of educational statistics, itTlSSO: "Those who know the 
summary manner in which committees often arrived at their conclusions jn 
respect to this (numbers of children in various types of schools), will 
use some degree of caution in reasoning from such data" (Kaestle & 
Vinovsky, 1980). Inability is harder to infer. But it's not an implau- 
sible reason for distortion in Chinese censuses of the 14th century and 
afterwards: the individual being counted might regara the act as depleting 
on^'s spirit, it's something of an embarrassment to have an unmarried, 
marriageable daughter in the household and so on (Jaffe, 1947). And it 
accounts, at least partly, for poor statistics on some diseases: 17th 
century attitudes toward venereal disease and its recognition appear to 
have been almost phobic, and probably helped to enrich the physicians of 
the period- 

5. Access to Data and Reana lysis 
Routine reanalysis of data frojn program evaluations is a relatively 
new phenomena. But tfie general notion of secondary analysis of social 
statistics is not. In the United States at least, it was implicit in 
Madison's arguments with Congress about the multiple uses of census 
information (Cassedy, 1959); It was dramatically explicit in arguments 
over social statist^ics just before the Civil War. Congression - 1 criticism 
of printing contracts for the 1840 census result and John Quincy Adam's 
interest in census inaccuracies led to the American Statistical Associa- 
tion's" investigating the data (Davis , 1972; Regan, 1978). There was con- 
siderable controversy since the statistic^ were used by slavery advocates 
such as John Calhoun to support the "petqmr" institution. The spirit 
of the enterprise in the laboratory has-been durable. It is reflected, 
for instance, in reanalysis published in 1929, of psychophysical data 
generated in 1873 by C.S. Pierce. Mosteller (1978), who provides the 
references, rummages stil 1 further, in the interest of illustrating the 
character of nonsampling error. 
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In recent years, good illustrations stemned from evaluative research 
on social programs. This includes fascinating reanalyses of Coleman's 
Equality of Educational Opportunity Surveys, appearing in a volume edited 
by, of all things, a senator and a statistician (Mpynihan & Hosteller, 
1972), of data from evaluations of Sesame street (Cook et al,, 1975), ilead 
Start (Magidson, 1977), and others • At times, the results are both sur- 
pnising and important, Leimer and Lesnoy (1980), for instance, appear to 
have discovered a fundamental error in the 1974 woric by Martin Feldstein, 
current president of the National Bureau of Economic J^esearch, The origi- 
nal work, used as d basis for policy, purported to show that social security 
had a large negative effect on individuals' savings. The reanalyses show 
no such effect and imply remarkably different policy. In Fowler vs North , 
the-'Supreme Court used an economist's estimates of the effect of capital 
punishment on homicide rate in reaching its decision on constitutionality 
of that punishment. At least one major reana lysis, done after the deci- 
sion, suggests that contrary to earlier conclusions, capital punishment 
does not have a substantial deterrent effect (Bowers & Pierce, 1980), 

Similarly remarkable changes in views come about occasionally in 
reanalysis of physical data. For instance, a health physics laboratory 
recently analysed rainfall samples following a suspected nuclear ejcplosion. 
Their findings on water borne fission products appeared to confirm the 
fact that the blast occurred, but independent tests suggested no such 
thing ( Science , 1980, 207 , 504), The original finding appears to have ^ 
been due to contaminated instruments- In the case of the University Group • 
Diabetes Program, a federal decisioo to require warnings on the use of 
tolbutam^e was made before the research was reported in professional 
forums and much before data was to have been released for secondary 
analysis (Kolata, 1979b)- The requirement was eventially rescinded when 
arguments over implications of the data became serious, ' 

The reanalysis of evaluative data carries no guarantee that it will 
inform any more than does reanalysis of other kinds of data does. Nor 
will it always be apparent that reanalysis will be more informative than 
primary analysis. Ingenuous optimis - about the latter appeared among some 
turn of the century professors and I see no reason to ignore that history 



and its implication. In particular, it is an unwarranted expectation that 
"as a multiplication table should be reliable for both the Tory and the 
Coimiunist, the conclusion of social trends should be valid alike for the 
radical and conservative" (Odum, quoted in Barnes, 1979, p. 62). The data 
will, for example, be used for purposes other than those for which it was 
collected, propsr^jy and improperly. Chambers (1965). for instance, recounts 
how the correspondence between time series data on small pox incidence and on 
vaccination campaign^ were interpreted by anj^rivaccinationists as a demonstra- 
tion of the invidious effect of vaccin^^*t51^ivhen in fact, the campaigns were 
mounted fol lowing tkg. onset of an epidemic. Barnes (1979) reminds us that 
Marx used data from Her Majesty's inspection of factories in ways "undreamed 
of" by the government. Debate about what the data mean can be extended. 
The UGDP trials ended vn 1968, but papers which purport to find the vitiating 
flaw in original interpretation continue to appear (Kilo, Miller, Williamson, 
1980). Fifte'h years after randomized field tests of cloud seeding in the 
Unitftd States, aroi^^nts about what the conclusions ought to be persist 
(Brcih^n. 1979; Heyinan, 1979). Durable debates are not less easy to find 
in fi'Jucatinal p;ojrani evaluation though they seem to be less grim and 
certainly lass vitupeAtiva than those in the medical arena. Magidson 
(1977) builds snnre p]a'u ^.Ic-. iiiodels for estimating that program's effect in 
^5c6 or so The nK^-deU seem not to have satisfied other scholars publishing 
in t valiiation Quarter ly since then. 

It has not ^'«wdy^ uean easy to secure date) for secondary analysis in 
any of the scIctct^ , 5>rcprietsry iiiverests, irsclared or not, seem to account 
for data not being raanifestTy available'co indepen*ient analysts when the 
ilorth delisiori or capJ£:i' punishment was reached. Indeed, the first major 
criilr.iiir. of tV analyses used in the ^as'i was based on conscientious recon- 
s!-.rucf;*!On of the da^a, from disparate sources checkad to assure that the 
data i^are similar zo if not identical to the informatioi i" original 

analyses (set Bowers and Pierce. 1980). The territorial imperative and 
•WiDnai diffi^rences arooiTj ocientistc appears in over .ha If^he chapters of 
Watson's DoubU Helix as obstacles to fitting better models of ONA structure 
t& raw data., --ray diffraction photographs. The proble/of access in the 
;Mtural ^.cie/'ces is sufficiently tantilizing to warrant the attention of 
independefit authors as well. Koestler's (197]) description of the tangle 



vpver Kanmerer's research emphasizes the ^ :ulty of acquiring the toads, 
he used or parts thereof to verify the sc*- ist's claims.^ 

When the data are held by an ins-t'^'ur matters become very diffi- 
cult indeed and may involve the court*- . ne problem is less one of dis- - 
cipline difference than contest between the governments stg^Mlr and civilian, 
'^Sharing information does not come naturally to the policy maker because 
knowledge is power" or so sayeth Yarmolinsky (1976, p, 265). Threats of 
legal suits under the Freedom of Information Act have been used to extract 
social data from ADAMHA, just as they have been used by physical scientists^ 
to obtain information from the Atomic Energy Commission on licensing cri- 
teria and from the Federal Aviation Administration on the Supersonic Trans- 
port (Primack & von Hippie, 1974). Thfe Department of Defense's refusal 
to disclose actual sites-of herbicide/ spray in Viet Nam impeded the attempts 
of the American Association for the Advancement of Science to verify the 
Department's claims that effects of spraying are iTegligible and to assess 
the laboratory data on the topic. (Primack & von Hippie, 1974). Iir 
Forsh^m vs Harris the access issue commanded the attention of the Supreme*""^ 
Court. There, %he suit brought by independent analysts argued that data 
generated in the University Group Diabetes Program trials should be made 
available for reanalysis. A period of groping among federal agencies to 
determine which one had the data was followed by a legal suit. Apart from 
the general scientific justification for access, it was argued that the 
data from a publ^'cly supported project were used as>a basis for major policy 
decision and this implied ^hat the information ought to he made available for 
reanalysis. The Court ruled against force^d disclosure. 

Institutional reluctance to disclose information is, not new of course. 
But it may come as a surprise that paragons of early statistical virtueT 
such as John Graunt, were not disposed to free access. In his introduction 
to Natural and Political Observations (1662/1973), Graunt advocated England^s 
keeping records universally on burials, christenings, and an assortment of 
Other events. But he adds . . why the same (statistics) should be made 
known to the people, otherwise then to please their curiosity, I see not" 
(p. 12). At the end of the monograph, he passes the buck: "But whether the/ 
knowledge be necessary to many or fit for others, then the Sovereign, and 
his chief Ministers, I leave to consideration," .(p. 74) presumably of these 
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same authorities. Graunt's unwillingness, or at least ambivalence, to 
disclose information was not unusual. De San|illa (1955) reminds us of the 
"Phythaogeanprivacy of research" that characterized views of Copernicus 
and Galileo. Neither they nor their contemporaries were much inclined to 
publicize some of their observations arid the constraints of religion eems 
to have be|n only part of the problem. Lecuyer and Oberschall's (1968) 
fascinating review of the history of social research in western Europe 
suggests swings between openness Implied by government ordinances requiring 
registry publication of births, deaths, and so oh in the 17th century, and 
the secrecy implied by surveys and reporting systems for taxation and mili- 
tary conscription of the 18th century. Nor does this seem to be a European 
phenomena. The secrecy that characterized storage of demographic data 
collected in 17th century Dahomey and in China in apparently all censuses 
is military in its origins. For Dahomey, this was probably less easy to ^ 
do than it sounds: counts were represented by large sacks of pebbles and 
updated often. 

In social statistics generally, ttiere have been recent efforts to nake 
information more readily available. Flaherty (l?7f for instance, took a 
leadership role in getting International aci efmenv on principles of dis- 
closure, principles which run counter to ccn:;£ "vsr i ve tradition of statis- 
tical bureaus in Britain and Germany amor. ^; ■ ' ers . In tho United States, 
there have been more t<i.v \^ ttw very recent efforts to assure that evaluation 
data are more readily avo ^ablp or review. Federal, rather than state, 
agencies, in criminal j.i^iUca research* education, and census, have devel- 
oped policy arid are teiti.<i it (Boruch, Wortman, Cordray, 1980). The same 
spirit is evident in recent advice to medical researchers that the need 
for secondary analys>4 of experimental data be stored for use by indepen- 
dent analysts (Hosteller, Gilbert, & McPeek, 1980) , the creation of data 
repositories for studies in metecrcliogy (Braham, 1979) , energy, environ- 
ment, and others (Kruzas & Sullivan, 1978). 



6, Individual Privacy and Confidentiality 
History \ \ 

Despite contemporary rhetoric, the privacy questions that emerge in 
social research efforts are not new. We can trace public concern about 
census surveys to 1500 B.C.. when in txodus (30:11-17) and Samuel (2 Sam. 
24:1-5), we find both God and man opposing military demography. Popular 
objections are rooted at least as much in a wish for intellective privacy 
as in a desire for physical self-preservation, and they are ^o less evi- 
dent in the early history of social research in the United States, An 
interest in sustaining at least some anonymity with re^spect to the govern- 
ment reveals itself in colonial New England's restrictivig the collection 
of data for 'public arithmetic'' to publicly acf«sjJb1e information (see 
Cassedy 1969 and Flaherty 1972), The priyaicy-'theme is implicit in Madison 
arguments with Congress over what jjata' should be collected in national 
censuses for the sakejpf^managtng the republic. It is explicit in 
Lemuel^ SbattOclc)^ reports to the Massachusetts Sanitary Commission in 
1879, which refer to public concern about the propriety of the then-naVel 
epidemiological survey and about the government's use of the resulting 
data, Shattuck's work foreshadowed controversy over rofrtine surveys of 
public health^nd the creation of archives containing informatipn on 
mortality and health during the late }300s (Duffy^ 1974). That con- 
troversy is no less apparent today in cfiia develof^ing countries, where 
for example, deaths may go unreported accouni t)f privacy custom, mem- 
ory lapse, or inheritance taxes. The cuTtfection of economic data has run 
3~s:imi^^rTy difficult coj^^^^ public demonstration against ithe Social 

Security Administration's record keeping during the 1930's reflecting a 
concern not only about personal privacy but, from coifTOrcial quarters, 
also about institutional privacy. 

That data obtained for statistical research ought to be maintained 
as confidential is probably at least as old an idea. But aside from the 
fine work of Flaherty (1972) anli Davis (1971v 1972), there is scant his- 
torical dociHnentation on the matter. In America at least, the idea is 
explicit irv guidelines issued tn 1840 by the Census Bureau, requiring 
that census enumerators regard as corTFTdentTaTlTtf^ obtained from 

their respondents^ (Eckler, 1972). Indeed^ the history of attempts to (nak? 



certain that t\\e respon,dent's fear of disclosure would not inhibit coop-» 
eration in social research can be traced throughout much of the U.S. 
Census Bureau's existence. As the amount of information elicited grew 
from the simple enumeration of 1790 to the economic and social censues 
of the early 1900's, and as the quality of surveys shifted from the aston- 
ishingly inept efforts before 1840 to the remarkably high-cnliber work 
of the present day, so too did the laws governing disclosuro /-from rules 
demanding public posting of information elicited in a census to explicit 
statutory requirements that information on individuals remaih completely 
confidential in the interest of preserving the quality of data available 
to the nation. The same theme is evident in the early development of 
economic welfare statistics, notably unde^r the Social Security Adminis- 
tration. The problem of deductive disclosurejs not a new one either. 
Ross Eckler!s (1972) history suggests'that risks pf accidental disclo- 
sure based on publ-ished statistical tables, most evident in the census of 
manufacturers, were officially recognized as early as 1910. 

Legisla^ive-pPOte(jtion has, in the case of ±he census, been helpful 
in resisting pressures brought to bear on this public interest by other 
public interests. The If.S. Census Bureau has successfully staved off 
demands for information on identified respondents .that range from the 
trivial to the ignominious. The latter include attempts to appropriate 
census records during World War II in an effort to speed up Japanese 
internment. There have been requests that were superficially worthy, 
including location of lost relatives, and others , that were not so worthy.. 
But the same level of protection in one quarter mail serve as a barrier 
in anothgr. Under current rvj^iles, one may not access census records that 
are under seventy-two years iold for sociomedical or psychological research, 
or any other type oi social .ri^'se&rchi 'The absence of such rules evidently 
fac,i;litated Alexander Graham Bell "s original genealogical research on / 
deafness, based on records available from the 1790 census onwards (Bruce, 
X975).'<' 

What is new then is not the occurrence of privacy concerns in social 
research, but rather their ^ine+d^ace and character. Social scientists, 
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including -those who have been traditionally uninterested in field research, 
have become mo r.e involved in identifying social problems and testing possi- 
ble solutions through field studies. This increase in the policy rele- 
vance of research generates conflict with some policy makers simply because 
a new standard-higher-quality empirical data-is being offered as a sub- 
stitute for a more traditional emphasis on anecdote and expert opinion. 
The increased contact between social scientists and individuals who are 
unfamiliar with their methods, objectives, and standards is almost certainly 
a cause of increased discord, incluiaing argument about privacy. Finally, 
the larger research efforts typicallj' involve a variety of interest groups 
and conirentators. The interaction of research sponsors, auditors, journa- 
lists, and groups of research participants with opposing views on the. value 
and implications of the research complicates matters. In this setting, 
privacy arguments may distract attention from far more important issues; • 
they may be entirely specious simply because reporting is inaccurate^ or 
they may be legitimate but intractable because the standards of interest 
groups differ remarkably, 

„, ■ . t 

Corruption of the Principl e ' 

It does not take much imagination to expect that, at times, a confi- 
dentiality principle will be- used honorifically. In the best of these 
instances, the appeal to -principle is pious but irrelevant— that is, there 
is no real threat to individual privacy or to confidentiality of records. 
At worst, the appeal is corruptive, dedicated not tb preserving individual 
privacy but to assuring secrecy that runs counter to the public interest^ 

In either case, social research and especially the evaluation oT"s^cta+- 
reforms are likely to be impeded. Lobenthal (1974), for example, reports 
that in designing evaluative research on correctional facilities: 
Even many [correctional] program personnel from whom we sought 
information rather than advjce withheld their cooperation. There 
was, for example, a sudden solicitude about clients' rights to 
privacy and ostensible concern with the confidentiality of records. 
When an elaborate protopol was worked out to safeguard confiden- 
tiality, .the, data j^r&ques ted were still not forthcoming, (p. 32). 

Similarly, the privacy issue has been used to prevent legitimate evalua- 
tions of some drug treatment programs in Pennsylvania, where records were 
destroyed despite immunity of record identifiers from subpoena under the 



/ . '87 



1970 Drug Abuse Act. It has been used to prevent evaluation of manpower 
training programs in Pittsburth and evaluation to mental health services 
•programs in southern California. It has been used to argue against the 
System Development Corporation's studies of integration programs in the 
New York City school system, despite the fact that children who responded 
to inquiries would be anonymous. These episodes dn not represent the 
norm, of course. They do represent a persistent minority event. 

Little vignettes at the national level are no less noteworthy, though 
the reasons for impertinent appeals to privacy differ a bit from the ones 
just described. For example, according to Boeckmann (1976), Senate sub- 
committee members used the privacy issue as a vehicle for discrediting 
researchers during hearings 3n the Negative Income Tax Experiment- She 



suggests that the action was 
ated income subsidy prog r*am. 



part of a drive to bury the i'dea of a gradu- 
More generally, the privacy issue has iaeen 
a convenient vehicle for assaulting national research that could tHreaten 
political interests, and for getting votes. Mueller (1976), fc^^ 
argues that former President Nixon's support of the Domestio'tJouncil on ^ 
Privacy, the Privacy Act, dnd theories of executive privilege did what 
it' was supposed to do— focus public attention on matters other than war* 
That both uses are persistent but low-frequency events „is evident from 
similar experiences in Norway, Sweden, and Denmark, as well as in the 
United States (Boruch & Cecjl, 1979). 

The most predictable adulteration of principle occurs before, each 
U,S. population census, when ritualistic assault competes with thought- 
ful criticism for public attention; To Charles Wilson, a former chair- 
man of the House Subcommittee on Census and Statistics, for example, much 
of the controversy over the 1970 Census was deliberately fomented by 
colleagues interested less in privacy than in votes, and by journalists 
moved less by the need for balanced reporting than by the need to generate 
provocative stories. Furt^f^r, the evidencis used in attacks on the census 
was often misleading. ^ 

Reference was continually /made to a /total of 117 [Census] questions 
despite the fact that this tot^Kco'uTd be obtained only by adding 
all th6 different inquirles^-on the forms desi gned for 80% of the 
populationv those for 15%, and those for 5%. A number of the 
questions appeared on one form only, and the maximum number of 



questions for any individual was actually less than 90. The 
question on whether the bathrooni was shared continued to be dis- 
torted into the much more interesting iversion "With whom do you 
share your shower?" (Eckler, 1972, p. 202) 

Similarly, in House Subconnii ttee Hearings, "One witness who had been 
scheduled to appear in support of legislation, proposed by Congress- 
man Betts to restrict the 1970 Census, admfltted that he had learned from 
earlier witnesses that his prepared statement was incorrect" i Eckler, 1972 
p. 204), 

An agency's refusal to disclose data on even anonymous individuals, 
under false colors. of privacy, is of course not a new problem, nor is it 
confined to the social science arena. Its origins, in the United States 
at least, date from the reluctance of the Massachusetts Bay Colony to 
disclose either statistical information cji'n mortality rates or records on 
the death of identified individuals, forjfear of jegpardizing 'their 
project (Cassedy, 1969). The data, if disclosed, would presumably have 
made the colony much less attractive a pirospect for volunteer colonists 
and for its conscientious sponsors. A similar reluctance appears to 
underlie the distortion isf fatality and accident rates published by 
commerciaK contractors for the Alaska pipeline (see the New York Times . 
7 August 1975). Institutional sel f-proiection of the same type has 
hampered the efforts o. biomedical researchers to understand the causes 
of the Thalidomide tragedy: the pharmaceutical company has refused to 
disclose its data on test subjects in statistical suwnary form or otherwise 
The idea is implicit in the refusal of , the Philadelphia public school 
system, during 1975-76, to disclose data on mine r-oups to 'the U.S. -^ 
Office of Civil Rights on the grounds bf student p ' y, though OCR 
required only statistical summary dat^. It is transparent in at least . 
one court case involving a school's efforts to resist, on Privacy Act 
grounds, the sampling of anonymous students by researchers who were 
interested in the racial biases that Jny underlie diagnosis of maladjusted 
and emotionally disturbed youths [ Privacy Journal . 1977; Lora v. Board 
of Education of City of New York (74 F.R.D. 565)]. 

There are, at times, good administrative and political reasons for 
an agency's refusal to disclose statistical records to a researcher or to 
permit researcher access to individuals. Though we may be unable to 
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subscribe to those reasons, it Is not In our Interest to confuse the rea- 
sons for refusing disclosure with the Issue of Individual privacy. It 
Is reasonable to anticipate that controversy will be instigated for pur- 
poses other than those advertised, even If we can offer no general advice 
here on preventing dispute. An^ we can offer partial solutions to one 
problem. 

7.. Public Interests an d the Quality 
of Evidence^ in Public Policy 

Reasoning from Information Is often not easy. And If the Information 
Is of an unfamillar'sort, 'as statistical data are for many, thp task is more 
difficult. Perhaps more important, the unfamiliarity makes it diffiqult to 
persuade others that the Information can indeed be useful and ought to be 
valued at least as much as experience and anecdote. 

^ As one might suspect, the problem is an old one. No formal history 
of public interest in evidence for policy purposes has been written. But 
it should come as no surprise that arguments about the matter are as old 
as recorded efforts to consolidate for public policy. Consider, for in- 
stance, John Graunt's (1662/1973) Natural and Political Observations on 
Bills of Mortality , a progenitor of modern tracts on policy statistics. 
The conclusionary chapter poses a question: 

"It may be now asked to what purpose tends all this laborious 
bustling arid groping? To know the number of . . . people, 
fighting men, teeming women, what years are fruitful, what 
proportions neglect the Orders . . and so forth (p. 71). 

Graunt makes no bones in his first response: 

"To this I might answer in general by saying that those who 
cannot apprehend the reason of these enquiries are unfit to 
trouble themselves to ask them" (pp. 71-72). 

riis second reason places him among many contemporary statisticians— 

". . . it is much pleasure in deducing so many abstract and 
unexpected inferences." 

And his third is more politic-- 

". . . the foundation of this honest and harmless policy is 
to understand the land and the hands of the iarritory to be 
governed according to all theiir intHnsic and accidental 
differences. . .by the knowledjge whereof trade and gbvernmnet 
may be made more certain and regular. . .so as trade might 
not be hoped for where i t is ifnpossible. . .(all) necessary 
to good, certain, and easy government. . ." (p. 73-74). 



Graunt's later remarks make 1t clear that he thinks it 1s In govern- 
ment's Interest to pay attention to statistics. Bu he Ig not at all con- 
vinced that there's any reason for disclosing the datrt tn the general 
public (see the section on Access). 

Despite such early efforts, the history of "evaluation," as a formal 
and sustained interest of government, Is embarrassingly brief.- The problems 
engendered by collecting high-quality evidence of course are not. This has 
some implications for the accuracy of our views of contemporary progressand 
products of evaluation. 

Progress is Slow 

In the United States, exploiting high-quality Information about socia 
problems has been of episodic, rather than sustained, national interest, 
and progress is' more typically sluggish than not. For instance, it was 
not until the 19th century that the country systematically confronted .naior 
flaws in the decennial census, longer to rectify them, despite the periodic 
recognition of problems in Europe, and elsewhere as early as the 14tb tury 
Naturally, rectification was stimulated by crisis. In the 1840 censii*; biack 
residents of entire towns were enumerated^as insane by interviewers with 
more political zeal than integrity (Regan, 1973). The remedial actiqn, 
appointment of census directors and regular staff partly on the basis of 
merit rather than on politics alone, helped. But another 80 years passed 
before the Census Bureau initiated a program of routine side studies on the 
quality of census data. 

Similarly, there were some interesting efforts by statisticians to assay 
the effect of law or other social intervention on statistics after the Civil 
War. Calkins (1890-91), for instance, published a careful article assaying 
the effect of England''s first major public health act on mortality, managing 
to detect and correct computational errors in the process. Indeed, he copies 
earlier work by Farr and does a little cost benefit analysis of the law 
estimating the value of human life at $770 ^perjiead (mo U.S. dollars of 
course). Yet the practice of evaluation o any sort much less- c^^^^ 
Valuation does not appear to have been f rQ|tine requirement of legislation 
for another 70 years. The earliest ranuomized field tests of medical regi- 
mens were undertaken in the jsarly 1930's. But well -designed randomized field 
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tests did n^^ become amm until the I960*s, and a fair nunher of poorly 
designed evaluations continue to be carried out (Cochran, 1976). The pro- 
portion of Mich trials reported in medical journals has Increased from 9% 1n 
1963 to 46?; In 1978; of the remainder, most appear to use no controls at all 
(Chalmers & Schroeder, 1979) • 

The execution' Qf randomized field tests of nonmedical regimens has not 
been especially routine either. Interest in experimental tests of social 
services programs which appeared in the 1930's (notably in hygiene) failed 
to contiTiUp, though it was rekindled in the 1970's ^Riecken & Boruch, 1978). 
Judging from Braham (1979), efforts t. mount scientific tests of weather 
modification methods can be traceo rx 1946, despite a long history of rituals 
designed to produce rain. He suggests further that the first fifteen years 
of such tests did not lead to much useful information about seeding but did 
yield development research tool s^ useful in the tests. The same development 
and lack of clear results characterized tests of educational programs during 

the 1960 's and 1970 's as well, 

0 

This inconsistency is not peculiar to medicine or the social sciences. 
The history of technology, for example, suggests that 25 bridges collapsed 
each year following the Civil War, but high-quality tests and adherence to 
structural standards did not become routine for another 60 years. The cur- 
rent renewed interest, among engineers ifnot the public, in bridge failures 
during the middle 1970's suggests that attention to quality control 30 years 
ago was rather too modest. The extent o^ interest in quality of evidence 
in any human enterprise,, and especially in social program evaluation, is 
recognized only occasionally. This often engenders naive opinion about 
~th&-prD€ess^^and that, in turn, has spme implications for government posture. 

Pockets of Interest "^^^ 

The occasional intensive efforts of citizen's groups to collect reliable 
data bearing on social problems is traceable at least to the early 1800's, 
Lecuyer and Oberschill (1968) for instance, attribute the appearance of local 
statistical societies in England during the 1830*s to the general interest 
in Social reform. The societies apparently organized private applied social 
research'of a quantitative sort to investigate health, working co.;jitions of 



'^the poor, aod education. Interviewers were hirs'„ and. sent door tp door. 
Similar groups appeared in. post-revolution France and in Germany during the 
middle 1800's. Lecuyef-bberschall icjentify Paris's efforts to abate prosti 
tution as an illustration of an early municipal evaluation. For the United 
States, Kaestle and Visnovski (1980) suggest that it is "no accidept that 
the appea-ranee of the flirst systematic school of statistics coincides with 
the educational reforms of the late 1830's and 1840's. The data were a 
crucial fool for reformers in their public relations efforts" .(p. 10). The 
spirit of the enterpris,e reappeared in" England in 1937 with the mass obser- 
vation, movement, propelled by the belief that anyone can make systematic 
inquiries" about social phenomena f.Barnes, 1979, p. 51-52). * 
■ • A similar spirit is reflected in contenporary private surveys of vol- 

- untary organizations such as the League of Women Voters, Common £ause, 
and the like. The more technical varieties' include the Stanford Workshops 
on Political and Social Issues (SWOPSI) in the physical and engineering 
sciences,, the various Conmitte^y of the Assembly of Behavioral Sciences of 

. the National Academy o^ Sciences, and others. 

Ways of Knowing and Inept Evaluation Design 

At least part of the variability of general interest in evidence is 
traceable to an embarrassment of riches. There are lots of ways of knowing, 
0/ apprehending information, and lots more ways of reasoning from the 

information. , , 

■ ' • ' . ' - ■ * ■ ■ ■> 

Conflict between one stereotypical way of knowing and contegnporary 

scientific method is exemplified by the battery additive case in 1951-54. 
In that instance,, a chemical manufacturer claimed that one of his produdts 
increased life- of storage batteries' significantly, despite the National 
Burea4i«bf Standards' tests on related compounds and the negative tests 
results. NBS Was eventually asked ta test the product,, providing evidence 
pi charges of fraud brought against the manufacturer by two government , " 
agencies. The NBS eventually reported-that the additive had no' detectable 
effects despite claims by the manufacturer, testimonials from trucking 
companies, 'and* other manufacturers, the ensuing battle pitted small" bu$i-. 
ness against^ regulatory power of government, and more important here, evi- 
dence of a, less formal sort against evidence obtained in- more systematic 
fasfjion. 



The seriousness of the debate was reflected partly by the Secretary 
of Comnerce's asking for the resignation ol" NBS's director. The tone of ' 
at least one side of theargument is reflected in the Secretary's char- 
acterizing himself as "a practicalman" rather than a man of evidence. 
Practicality was also espoused by the Chairman of the Senate Committee on 
Small Business who questioned the director: "The simple truth of the ques- 
tion is. that if a good hardfisted businessman has used the product in a 
fleet of motors. . .and places orders month after month, what is the matter 
wiht him? Or otherwiseVnyhat is the matter with the Bureau of Standards' 
Test?" (p. 159). The director's understated response emphasized the con- 
trolled conditions required for scientific inference and cited Sinclair 
Lewis's ArrOwsmith to illustrate. That appears to have been necessary but 
not sufficient to the eventual NBS victory. 

The more dramatic examples of inepl evaluation design have occurred in 
medicine, where medical or surgical remedies, adopted on the basis of very 
weak evidence, have been found to be of no use at best and to be damaging 
to the patient at worst. Case stu.dies are not too difficult to find. 

For instance, the so-called frozen stomach approach to surgical treat- 
ment of duodenal ulcers, for example, was used by a variety of physicians 
who imitated the technique of an expert surgeon. Later well-designed exper 
imental tests showed prognoses were good simply because the surgeon who 
invented the technique was good at surgery and not bfecause his innovation 
was effective. It provided no benefit over conventional surgery (Ruffen 
et al., 1969)! / 

Prior to 1970, anticoagulant drug treatment" of stroke victims had 
received considerable endorsement by physicians who relied solely on per- 
sonal, observational data for their opinions. ^Subsequenft randomized exper- 
imental tests showed not only that a class of such drugs ftad na detectable 
positive effects but that they >could be damaging to the patrents^ healthy 
(see Hill.et aT., I960, and other examples described In Rutstein,. lies). 

There are localized examples too of , course. Consider, for instance., a 
recent Science (Vol. 107, 1980, p. 161) articVe on the use of snake Venom. 
A Florida physician ^laimed 20% cure rates of patients treated for multiple 



sclerosis with vtfnom. The pjiysiiclan received enough publicity to force the 
FDA to give it attention. The Food and Drug Administration sponsored a 
worl^liP to determine if the evidence justified the- design and execution 
of controlled clinical trials. The main conclusion seems to be that the 
evidence is weak, and moreover that multiple sclerosis (one of the diseases 
for which cures were claimed), "follows such an erratic path that it's . 
impossible to attribute improvements to any therapy without double blind 
studies." The evidence for people was not sufficient to override tests 
of other options. ^ 

There have been some recent efforts to characterize this problem 
statistically. One such approach has been to illustrate a declaration that 
a program or regimen is successful depends on quality of the design. Con- 
sider, for instance, Gordon. and Morse's (1975) review of published evalu- 
ations of social programs. Their appraisal suggests that the. probability 
of an evaluator winding up with a declaration that the program was a 
"success" baseld on a poor design is twice the probability based on good 
designs. Chalmer's (1972) analysis of a small sample of medical in- 
vestigations on estrogen therapy of prostate carcinoma suggests that 
enthusiastic support of the therapy- was alniost guarapteed when.tfte exper- 
iment was poorly designed. Improvement takes time'. And there has been 
an .improvement at lea^t in the sense that better designs are being used^ 
more frequently. For instance, Chalmer and Schroeder's (-1979) estimates 
of the proportion of experiments reported in the New England Journal of 
Medicine suggest that there has been a fi^e fold increase in the number of 
studies employing ranijomized controls over a 25 year peri qd to 1978. A 
similar analysis of studies appearing in Gastroenterology suggests that 
the fnaction of excellent ones has increased firom 5%' to about 30%- during 
1953-1967. Similar problems are alleged to have affected the food industry. 
According to Samuel Epstein (National Acaderry of Sciences, 1974, p. 221), 
In 1967, 50% of all petitions submitted to the FDA in support of food 
additives were rejected. . .because of incomplete, inadequate, or non- 
specific data" (p. 221). (I. have not been able to locate more.recent 
estimates.) ^ . ^ 



Choice. Approximation, and Compromise ' 

The need to' choose between acquiring statistical information whosfe 
character Is well understood and obtaining information of a less formal 
sort occurs often. In evaluative research, for instance, it emerges in 
debates about whether to' invest in randomized field tests rather than in 
less expensive designs that yield more ambiguous information. It appears 

. in debates about whether to mount designed surveys or to settle ^fpr a 
New Yorker essay based on a quick site visit. It is implicit in contem- 
porary arguments over the proper balance bf Service Delivery Assessments 
.(fast turnaround studies) and more elaborate research. The arguments often 
pit manager against technologist, substantive expert against. statistician, 
approximators against purists. 

The problem is an ancient one, judging from Rabfnovitch's (1973) little 
monograph on statistical inference in medieval Jewish literature. In dis- 
cussing h\e iilea of variability and sampling in the talmud and mishnah, he - 
descHbes a second century rabbinical argument over the appropriateness of 
taking a systematic sampleV of olives say in the interest of judging worth 
of crop for tithing, rather than an informal one— grabbing a convenient 
handful and making-a declaration about worth. One result of debate appears . 
to have been that "only in matters of lighter consequences, for example, 
prohibitions that are of rabbinic but not biblical origin, may one assume 
that perfunctory mixing gives an accurate sample." (p. 83). Roughly 
speaking, the rabbis' Judgment was that approximation is therji permissible . 

' for management purposes. . It is not seemly if the demand comes from a dur- 
able and important source, such as God. "■ 

In engineering, similar tension is reflected in other ways. Borg (1962) 
for instance suggests tha't structural engineering evolved into two camps witfi 
less than cordial relations: efjgineefing elasticity and strength of materials 
The first counted the theoreticians and mathematicians among its members. 

"The second comprised builders, crushers, and benders, engineers with a 
taste for tHe cohcrete so to speak. Something of the same spirit char- 
acterizes the spl it betWieen exper1ment|l physics affd i^s theoretical sister,' 
fluid dynamics, aYid other fields. , The gap in statistics is wide enough to 
concern the professional cftramunity,/ judging from the presidential address . 
at the International Statistical Institute's Warsaw meetings. And it 



char^^cterizes at least a few b|tter struggles In economics during the 
1930*5 (Strotz, 1978). Settling on the appropriate level of precision 
or at least developing a rationale for it has been no easier for the his- 
torical demographer, judging from Holllngsworth (1968). The fact that 
the decision must rest on still earlier ones which might not be explicit 
makes ^matters more difficult. Early Chinese censuses were confined largely 
to cultivators, able bodied men » and taxpayers. "About the total population, 
the soverign'did not wish to know" (Jafffe, 1947, p. 309).^ 

The result In engineering at least sometimes takes a form similar to 
one taken in the social sector, though it is more formal than the latter. * 
So, for example, standards for the classification of geodetic control have 
been developed and pinned to functional uses of the information. Local 
geodetic surveys are subject to less rigorous standar(Js of precision than 
are scientific studies and metropolitan area surveys. The idea of tolerance 
bands of this sort characterlze'most engineering disciplines of course -and 
the product depends on the use to which the bearing, strut, detonation timer,, 
and so on is put. The depth to which, the Idea hai penetrated In the soclaJ 
sector is not great. It is present in any^formal *btatistical cjeslgn or 
Statistical power analysis. . It is not evident In reaulatlons that require 
uniform evaluation methods at local and state level, though, and there appear 
to have been no systematic trieatments of the usefulness of broadening - * 
tolerance limits, numberical or otherwise. In dealing with local enterprise.* 

Language - ^ ^ 

With customjiry style, John Kenneth Galbraith announced that "a certain 
glib mastery (of the language of economics) Is easy for the unlearned and 
ftiay even be aided by a mildly enfeebled Intellect." The language, like V 
the vernacular of other social sciehces may Invite seduction because It 
' deals with human affairs." But other aspepts of scientific vernacular are 
intewesting, and/the. physical sciences are not entirely Immune to the problems 
It engenders. These features include the creation of new, official meanings 
for existing jwords, causing confusion among the profane. If the scientist, 
especially the social scientist, seeks to avoid the problem by inventing 
new words, then lexicographic assault may follow. The confusion mounts when 
the new words\are popularized mistakenly in the press or by public repre- 
sentatives. 



To be sure, emerging areas of inquiry such as evaluation are usually 
characterized by a good deal of lexical ambiguity. Glass and Elliott 
(1980) rummage through contemporary papers to find evaluation defined as 
applied sciences, systems management, decision theory, assessment of pro- 
ngirtess toward goals, description or portrayfl(l , and rational empiricism, . In 
our own Investigations (Boruch & Cordray, 1980), we have ^interviewed a 
director of research who announced his office did no evaluaHon^ and his . 
boss, at the deputy secretary leveT, who announced that everything they 
are responsible for Is evaluation. We encountered Congressional staffers 
who. In criticizing research or evaluation, fail to distinguish among eval- 
uation, research, development, and monitoring. We also talked to support 
agency staff members who eschew the word evaluation entirely, preferring 
Instead simply to specify what question is answered by the process: Who 
Is served?^ How well are they served? How much does It cost? And what 
are the effects of service? The phrases Invented by academicians to clarify 
are sometimes remarkably effective in consolidating a variety of related 
themes under a single rubric. The less durable one^i confuse and it is 
difficult "to praise famous coiners of :ww words and the ?ioppy nomencla- 
tors that begat them" (Howard, 1979, p. 153) if the new ones are no better 
than the old. The student is offered "fortnative** evalust^c^i ins^d of 
troiible-shooting or development, "surwnative" evaluation instead(^f estimat- 
kg program (Effects, and ^'meita-analy^ls" instead of synthesis or combining 
estimates of effect. Thtsse arid other new phrases have become potois in 
much less than ten yew«rs. There are still -many e valuators who try to 
speak English, however. 

The adoption of some of these words by politicians and journalists 
has its parallel in the adoption from other di'-sciplines of phrases that 
are aural Ij^ attractive and equally vague. The phrase "representative 
sampling" for Instance, has no formal defin^ition in mathematical or applied 
Statistics. In, the nonsc^^-^'^^flc literature, for example, it is used to 
imply that given sample has a sort of seal of approval or a vaguely scien- 
tific dignity* It is used to indicate a miniature of a population, to 
suggest that extreme or varied units have been examinecj^ (Kruskal & 
Hosteller, 1979a). In the nonstatistical scientific literature. It is used 
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as a less forbid-!1ng synonym for ratvdom sampling (Kruskal & Mosteller^ 
1979b); I e)^pect that the word experiment" Is used In at least as many 
ways. The word was appropriated from simple language by statisticians. 
Unless told otherwise, the latter would expect the thing- to be randomized 
and It Is now used to lend an aura .of scientific legitimacy to the process 
of merely trying things out In laboratory and field, settings. 

There are also words that hav^e beconi^- popular by mistake. Their 
misuse Is more pleasing than proper use, and In any event. It Is hard for 
the non-scientist to understand the correct definition. Howard's (1979) 
catalog of words of this Ilk is fascinating: "Quantum jumps" are not very 
'big ones as Its users usual ly Imply; rather, they'are exceeding small" tran- 
sitions from one energy state to another. He suggests, incidentally, that 
the term's abusers be made to walk tlie Planck, because they have got hold ' 
of the wrong end of the quark. Feedback too has gotten apprfepriated 
inappropriately. Geriatric implies health for the aged; its reference' to 
the long of todth is Incomplete. 

The py-oblem is not a new one of course. In social statistics, at' 
least, it's recorded history dates at least to C. F. Pidgin's (1890-91) 
efforts to, popularize statistics in a sfeemly fashion. He objected vigor- 
ously to the gobbledygook invented to praise, to obscure, aad especially 
attack: "Now we have statistical roorbacks (supplementing the literal 
variety) and neither the politicians nor the people understand them 
(p. 109)." He also made a plea' for simple sunm^ry and homely comparisons, , 
echoed 85 years later 1^ the New York Times , e.g., "How does the ,r is k. the ; 
FDA has moved against compare with the risk of breathing norial polluted 
air in Manhattan?" (National Academy of- Scjeiices, 1974, p. 27). 

The lexical diff icul yes are 'tedlo^- fry^ and unnecessary 

at times.. In short, they are normal. The puzzling part is why we do not 
expect them and have nti better ways to deal with them. 
\ 8. Use of Randomized ExDeriments in the Physical Sciences 
r ^^^,0^^^^ social science student suggested 

that because randomized tests are not often used in the physical sciences 
and engineering, one ought to be suspicious about their utility. The 
premise, rarity of experiments in the area.- is not unreasonable when one 
considersthat few undergraduate science courses stress the topic. But 
it does not reflect reality well. ^ . 



Consider, for example, recent research on weather control, Over the 
past 25 years, both randomized experimental tests, as well as quasi -experi- 
ments, have been run to determine whether the introduction of silver iodide 
crystals into the air will under certain conditions increase the probability 
of precipitation • As In the social sciences, the Incidence of non-randomized 
experiments exceeds that of randomized trials, The work does Involve the 
physical sciences sinte weather dynamics, chemistry, as well as some 
knowledge of the natural ^sciences such as atmospherics. This is also a 
nice illustration of a research endeavor in which the distinction between 
physical sciences and natural sciences ceases to be meaningful. So-called 
Gjrossversuch 3, which lasted seven years and ended in 1963 is among the 
largest of weather experiments. The unit of randomization in the experi- 
ments was a 24 hour period. Each pe.riod was randomly assigned to an 
actV^nty of seeding clouds or to an activity, of not seeding clouds, con- 
ditional on prior predictions about whether thunder storms with hail were 
to be- expected on the day in question. The experiments were conducted on 
the southern slopes of the Alps in Switzerland and Italy (see Neyman, 1977, 
for example). * 

Related efforts Include the National Hail Research Experiment, under- 
taken in Colorado and Nebraska .in 1972-1974 hail seasons (see Crow et al.,"''^ 
1977). The experimental unit was the declared hail day, determined on the. 
baals of radar reflectivity dat^. "A random 50/50 choice to seed or not 
to seed was applied only to theS^irst day of a sequence of one or more hail 
days; subsequent days^in^a sequence were given alternating treatments .^^ 
Treatment consisted of seeding clouds with silver iodide crystals using 
rockets, pyrotechnic flares, etc. Response variables w^ re hail size (smaller 
circwnference than control days), rain mass, and others. Apparently 
seeding had no effect at the 10% level of significance. 

The randomized trials describe by Braham (1979) have been subjected 
to intensive Independent scrutiny ai\d secondary analysis. I The material 
given earlier in Section 2 suggests that there are parallels between the 
operational problems in meteorological experiments and experimentsl tests 
of social programs* The^ Colorado trials described by Elliott et'aK 
(1978) exhibit similarities as welV. • ' ' 



A second broad, category of randomized experiments in 



the physical 



sciences involves tests pf material strength of materials jas a function 



the shape of the material, Its composition, and other factors. Here again, 
the use of randomized experiments Is less frequent than the use of non- 
experiments. However, one can find controlled studies of, for example, the 
e'ffect of fibre diameter on the fatigue strength of metal composites making 
up the fibres. In civil engineering research. It Is not difficult to find 
randomized experiments In which the hardening time and characteristics' of 
the thickening process of cement are* examined as a function of temperature, 
pressure, and other physical properties of the cement, the unit of random- 
ization is a sample from a batch, several units being extracted from each 
batch in. (order to make up, replications. In chemistry, the light fastness of 
dytf bases has been explored using randomized experiments with chemical com- 
position, of the dye base as a treatment variable, and a sftable element of 
the dye bhse as a blocking variable. TKe problems of designing efficient 
experiments in chemical processing have been sufficient to produce a sub- 
stantial body ef literature on optimization, by V. V. Fedorov at the 
University of Moscow among others. 

More generally, it'snot difficult to find major technical reports on 
randomized experimental designs in the physical sciences and engineering, 
issued by, among others, the National Bureau of Standards and the Genera] 
Electric Corporation. Each has research laboratories which have produced 
reports on fractional factorial designs. The Office of Aerospace Research 
at Wright Patterson Air Force Base in Ohio has issugd important reports on 
complex fractional factorial designs. 

Finally, there are at least a half dozen textbooks. avail able on experi- 
mental design in the engineerirjg sciences, especially industrial engineering 
and related areas. Books* by Brownlee, Daniel (1976), Davies (1971), and 
Chew (1958J are notable. Still'more generally, a sizaWe number of indi- 
viduals yfho've made distinctive contributions to applied* statistics over 
the* past 20 years have done so through their involvement with applied research 
in the physical sciences. This. includes, for example, Gl E. P. Box (Hunter 
& Box, 1965) and H. Sc.heffe. (1958) , as well as individuals who are fetter , 
known for^-thelr work in agricultural research, such as Youden and.Kempthorne. 
In fact, some major areas of experiment.^'! design have grown primarily out 
of work In the i_ndustriiil-and enginefin:^ sector: fractional factorials, 
weighing" designs and special i zed designs |6r understanding chemical mixtures 



and the Influence which externally manipulated factors around them. The 
work Is reported regularly In Journals such as Technometrl cs (e.g., Webb, 
1973), and Biometrics (e.g., Davles & Hay, 1950). 



Footnotes 



I am grateful to the support of the National Science Foundation (DAR 
7820374) for support of work on evaluative methods In the social 
sciences, and to the National Institute of Education (NIE-Q-79-0128) 
fqr support, of wo^k on evaluation In education. Portions o? this , 
paper have been presented at the University of Jerusalem In June 1980 
and at the U.S. General Accounting Office. William H. Kruskal kindly 
provided suggestions on an earllrfV' draft. 

The failure of a suspension bridge over the Main River (France) In 
1850 also provides a nice Illustration of malformed federal regula- 
tions. The'ampllfled vibration was caused by the Eleventh Light 
Infantry's cadence march, All perished In the ensuing bridge collapse. 
Daumas and Gllle 0968/1979) report that general orders were Issued that 
Infantry break step on bridges as a consequence. Because bridge type 
was not specified, Infantry broke step on all bridges. 

Samuel Johnson did not pussyfoot around palmistry either. His diction- 
ary defined It as "the cheat of foretelling the futureby the lines of 
the palm" (Johnson, 1955/1979). 

The clever exploitation of casualty statistics Is Occasionally matched 
by cleir: contemporary reporters. For Instance, Henry Kissinger's auto- 
biography \cl a 1ms that his decisions about Invading Cambodia* were 
justified partly on account of the (anticipated) corvtlnuous drop 1n 
American casiialty rates following Invasion. Wllltam Shawcross neatly 
assaults K. 1n a footnote i citing the cyclical character of casualties 
In Viet Nam and the declining secular trend underlying the cycle as a 
competing,, more, plausible explanation for the dec 11 he, a trend attribu- 
table mainly to the gradual withdrawal of troops' from comWt areas 
( Harper's . November 1980. dp. 35-44. 89-97V. 

■■■■■■■ ^^0 ■ • ' 

The effort to secure the original specimens appears to have been 
vlgoroijs. The^denouemetnt seems not tikhave Impaired much "the instal- 
lation of Lysenko's beliefs about inriPitance of acquired character- 
istics in Russia during the 1950' s (Zirkle, 1954). * 

Excerpted from Boruch and Cecil (1979). 
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